Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 16:54 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.62.0505031104080.14033@localhost.localdomain>

On Tuesday 03 May 2005 11:04, Nicolas Pitre wrote:
> On Tue, 3 May 2005, Chris Mason wrote:

> > coffee:~/git/linus.orig # echo foo > foo
> > coffee:~/git/linus.orig # echo foo2 > foo2
> > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > coffee:~/git/linus.orig # ls -la delta1
> > -rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
> > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
>
> OK, doh!

Thanks, this one works ;)  I'll kick off a run with this replacing zdelta, 
should be around 3 hours.  For my small tree run with 300 patches, its faster 
than zdelta with about the same space savings.

-chris

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 16:35 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.61.0505031153550.32767@cag.csail.mit.edu>

On Tuesday 03 May 2005 11:57, C. Scott Ananian wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > your delta generator later this week.  Some quick and dirty space numbers
> > to show why we need to pack the files together:
>
> Are you accurately accounting for the cost of the extra hard/soft links
> your scheme requires?  Ie the directories get larger, lookups take
> slightly longer, etc.  Also access to a given file takes longer, and the
> deltas are referring to *other* packed files which *also* take longer to
> decompress and access...

My patch doesn't create any extra directory entries because the file for the 
packed file is unlinked after all the hard links are made.  Even if I kept 
the packed file directory entry, I'd adding one directory entry and saving an 
average 6-7 inodes per commit.

>
> How much better does delta-fication do, compared to just packing?

The best case for just packing is to pack the blobs, trees and commits all 
into one object.  Doing all three brought the tree down from 2.5GB to 1.57GB.  

The delta patch does pack trees together, but not into the same file as the 
blobs, and commits are not packed at all.  This is just because it is a pain 
to carry those changes around; it'll be easy to do later.

With the delta patch, the tree is around 900MB, I estimate packing the commits 
and trees into the blob files would save another 200MB.

Because space savings is so tightly coupled with packing ratios, a script to 
repack blobs, trees and commits from multiple commits will give much better 
compression.  Right now  the patch does not delta trees or commits, but it 
might make sense to delta the trees via the repacking script.

-chris

^ permalink raw reply

* Re: questions about cg-update, cg-pull, and cg-clone.
From: Daniel Barkalow @ 2005-05-03 16:30 UTC (permalink / raw)
  To: Zack Brown; +Cc: Petr Baudis, Git Mailing List
In-Reply-To: <20050503152214.GA1704@tumblerings.org>

On Tue, 3 May 2005, Zack Brown wrote:

> So, suppose I'm working on your Cogito HEAD. I make some changes to my local
> tree and commit them to my tree, and then before I go forward, I want to grab
> whatever you've done recently, to make sure we're not in conflict before I add
> new changes. If I understand you right, this situation would be a 'fast forward
> merge'. So what is the command I give to just 'merge' your HEAD with mine,
> without requiring a changelog entry?

In this case, you have to do a tree merge, because you have some commits
and he has some commits, and you want to be in a state where you have your
commits and his; this state is new, so you need a new commit with both
lines as parents.

> Alternatively, suppose I'm you, the project lead, and Zackdude has some
> changes for me, based on my HEAD. I want to 'merge' his tree into mine. If
> I'm still understanding you, this is a 'tree merge'. Now I give a cg-update,
> and now I *want* to give a changelog entry to record the merge.  Correct?

In this case, you don't have any commits that the other guy doesn't
have. Zackdude took your tree, made some changes, and that's his
head. Your head is still the same. He's already specified what happens
when you go from your head to his head; that's what he did, so the answer
has to be his head. That's a fast-forward.

Now, if the project lead decided to update from a second contributor who
hadn't rebased their contribution on the new head, then a merge is
required, to resolve the potential conflicts, and this merge needs a
commit.

> No, I still don't see it. I don't see why I would want to add an additional
> changelog entry on top of whatever changelog entries Zackdude has made himself.
> It just seems to pollute the changelog with entries that are essentially
> meaningless. When I read back over the logs, I'm not going to be interested in
> the bookkeeping of when I merged with various developers, I'm going to be
> interested in what those developers actually did to the code, and what *I*
> actually did to the code.

If developer A's changes work, and developer B's changes work, but they
don't work in your merge of them, you'll want to see that. Furthermore,
without a commit with both of their commits as parents, you can't reach
both of their histories from anywhere.

> OK, I don't understand this either. What is the difference between fetching the
> stuff and merging the stuff? Suppose I am working on a local repo of Cogito
> HEAD. I make some changes, commit them, and then I do a cg-pull. What happens?
> Are my changes overwritten? Do they show up at all? Do they exist in some
> nebulous ether that I will never see until I do a merge?

If you do a "cg-pull pasky", this doesn't change any of your stuff, but it
means that "cg-diff -r pasky" will now compare against his new head,
rather than the head he had when you previously did stuff. "cg-log
pasky" will include the new messages, and so forth. Also, you can then do
the merge without a network connection; you can pull overnight and merge
on the train.

You don't see anything different in your working directory, but your
repository essentially "knows more".

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Bill Davidsen @ 2005-05-03 16:22 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>,
	Linus Torvalds, Ryan Anderson, Andrea Arcangeli, linux-kernel,
	git
In-Reply-To: <20050503012921.GD22038@waste.org>

Matt Mackall wrote:
> On Tue, May 03, 2005 at 03:16:26AM +0200, Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org> wrote:
> 
>>Linus Torvalds <torvalds@osdl.org> wrote:
>>
>>>On Mon, 2 May 2005, Ryan Anderson wrote:
>>>
>>>>On Mon, May 02, 2005 at 09:31:06AM -0700, Linus Torvalds wrote:
>>
>>>>>That said, I think the /usr/bin/env trick is stupid too. It may be more
>>>>>portable for various Linux distributions, but if you want _true_
>>>>>portability, you use /bin/sh, and you do something like
>>>>>
>>>>>#!/bin/sh
>>>>>exec perl perlscript.pl "$@"
>>>>
>>>>if 0;
>>
>>exec may fail.
>>
>>#!/bin/sh
>>exec perl -x $0 ${1+"$@"} || exit 127
>>#!perl
>>
>>
>>>>You don't really want Perl to get itself into an exec loop.
>>>
>>>This would _not_ be "perlscript.pl" itself. This is the shell-script, and
>>>it's not called ".pl".
>>
>>In this thread, it originally was.
> 
> 
> In this thread, it was originally a Python script. In particular, one
> aimed at managing the Linux kernel source. I'm going to use
> /usr/bin/env, systems where that doesn't exist can edit the source.

On the theory that my first post got lost, why use /usr/bin/env at all, 
when bash already does that substitution? To support people who use 
other shells?

ie.:
    FOO=xx perl -e '$a=$ENV{FOO}; print "$a\n"'
-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 16:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.58.0505030804170.3594@ppc970.osdl.org>

On Tuesday 03 May 2005 11:07, Linus Torvalds wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > On the full import of all the bk->cvs changesets, the average file size
> > in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.
>
> Have you checked how many of those are blobs?
>
I've got cg-admin-lsobj running (effectively find .git -type f | xargs 
cat-file), it is taking a looong time but the ratios seem to stay pretty 
constant as it makes progress:

total: 186863
blob: 93688     (6.6 per commit)
commit: 14172
tree: 79003      (5.5 per commit)

> For many commits, we generate as many (or more) _tree_ objects as we
> generate blobs.
>
> And tree obejcts from the same "supertree" really is something that I
> wouldn't mind packing some way, because they really tend to be very much
> related (since they refer to each other). Eg the commit and the top-level
> tree are almost always a pair, since you'd get a shared top-level tree
> only with two commits that have the exact same content (which definitely
> happens, don't get me wrong, but it we get some duplication for that case,
> we'd still be winning).
>

The packed item patch wouldn't duplicate info in this case.  When it initially 
creates the packed buffer (before compression), it checks for an existing 
file with the same sha1 and returns if one is found.  This is to preserve the 
optimizations for write_tree case where it frequently tries to create files 
that already exist.

-chris

^ permalink raw reply

* Re: questions about cg-update, cg-pull, and cg-clone.
From: Joel Becker @ 2005-05-03 15:59 UTC (permalink / raw)
  To: Zack Brown; +Cc: Petr Baudis, Git Mailing List
In-Reply-To: <20050503152214.GA1704@tumblerings.org>

On Tue, May 03, 2005 at 08:22:15AM -0700, Zack Brown wrote:
> So, suppose I'm working on your Cogito HEAD. I make some changes to my local
> tree and commit them to my tree, and then before I go forward, I want to grab
> whatever you've done recently, to make sure we're not in conflict before I add
> new changes. If I understand you right, this situation would be a 'fast forward
> merge'. So what is the command I give to just 'merge' your HEAD with mine,
> without requiring a changelog entry?

	Remember that HEAD is merely a SHA1 of the toplevel tree object.
Imagine you have the simplest tree, one directory containing one file.
The file has the has hash aaaaaa.  The tree object containing it has the
hash bbbbbb.  So, HEAD contains bbbbbb.
	Now you update from Petr, having made no changes.  You pull his
newest tree, which also has a new file.  That new file has the hash
cccccc.  The new tree object, containing both files, now has the hash
dddddd.  HEAD now contains dddddd.  As you are in a matching state to
his tree, you have not done anything interesting to your tree, and there
is no commit.  This is a "fast-forward" merge.
	Then you change the first file, adding a few functions.  You
commit it, and it now has the hash 111111.  This change means the tree
hash becomes 222222.  So, HEAD contains 222222.
	You then update from Petr again.  He's changed the second file.
It's hash is no longer cccccc, it's eeeeee.  In his tree, the hash of
the tree is 333333 (from file 1's aaaaaa and file 2's eeeeee).  But the
hash of your tree is 444444 (from your local file 1's 111111 and file 2's eeeeee).  So, the hash of the your tree becomes 444444.  Your HEAD contains 444444.
This does _not_ match his 333333 HEAD.  You are committing the
combination of his change and yours.  He is saying that this work, which
may have required hand-merging or commit resolution, is "interesting"
information.

Joel

-- 

Life's Little Instruction Book #69

	"Whistle"

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: C. Scott Ananian @ 2005-05-03 15:57 UTC (permalink / raw)
  To: Chris Mason; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git
In-Reply-To: <200505030724.57827.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:

> your delta generator later this week.  Some quick and dirty space numbers to
> show why we need to pack the files together:

Are you accurately accounting for the cost of the extra hard/soft links 
your scheme requires?  Ie the directories get larger, lookups take 
slightly longer, etc.  Also access to a given file takes longer, and the 
deltas are referring to *other* packed files which *also* take longer to 
decompress and access...

How much better does delta-fication do, compared to just packing?
  --scott

NSA FJDEFLECT radar WASHTUB justice LCFLUTTER KUCLUB PBHISTORY Ft. Bragg 
ammunition immediate ESMERALDITE DC terrorist C4 SLBM affinity group
                          ( http://cscott.net/ )

^ permalink raw reply

* Re: RFC: adding xdelta compression to git
From: C. Scott Ananian @ 2005-05-03 15:52 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.58.0505022215110.21733@bigblue.dev.mdolabs.com>

On Mon, 2 May 2005, Davide Libenzi wrote:

> On Mon, 2 May 2005, Linus Torvalds wrote:
>
>> Yes. EXCEPT for one thing. fsck. I'd _really_ like fsck to be able to know
>> something about any xdelta objects, if only because if/when things go

> Linus, xdelta-based algorithms already stores informations regarding the
> object that originated the diff. Since they have no context (like
> text-based diffs) and are simply based on offset-driven copy/insert
> operations, this is a requirement. Libxdiff uses an adler32+size of the
> original object, but you can get as fancy as you like in your own
> implementation. Before a delta patching, the stored information are cross
> checked with the input base object, and the delta patch will fail in the
> eventuality of mismatch. So an fsck is simply a walk backward (or forward,
> depending on your metadata model) of the whole delta chain.

Linus knows this.  His point is just to be sure you actually *code* that 
walk in fsck, and (hopefully) do so w/o complicating the fsck too much.
  --scott

supercomputer BOND quiche SYNCARP Honduras North Korea Qaddafi PANCHO 
SKILLET KUDESK non-violent protest ESQUIRE struggle Saddam Hussein
                          ( http://cscott.net/ )

^ permalink raw reply

* [PATCH] add the ability to create and retrieve delta objects
From: Nicolas Pitre @ 2005-05-03 15:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.58.0505030742330.3594@ppc970.osdl.org>

On Tue, 3 May 2005, Linus Torvalds wrote:

> On Tue, 3 May 2005, Nicolas Pitre wrote:
> > 
> > Yep, that's what I've done last weekend (and just made it actually 
> > work since people are getting interested).
> 
> I have to say that it looks uncommonly simple. Also, afaik, this should
> still work with the current fsck, it's just that because fsck doesn't
> understand the linkages, the error reporting won't be as good as it could
> be (I'd _much_ rather see "delta failed in object xxxxx" than "unable to
> read xxxxxx").

Yep.  Let's do it in a separate patch if you please.

> Now, one thing I like about this approach is that the actual delta 
> _generation_ can be done off-line, and independently of anything else. 
> Which means that the performance paths I care about (commit etc) are 
> largely unaffected, and you can "deltify" a git archive overnight or 
> something. 

Yes.  And actually you can use any kind of delta reference topology as 
you wish.  It may start from the first object revision and the next 
revision is a delta against the first, the third a delta against the 
second, etc.  But it is much more interesting to do it the other way 
around, such that the second revision is stored as is and the first 
revision is made a delta against the second revision.  Then on the next 
commit the third revision is stored as is and the second rev made a 
delta against the third, and so on.  You therefore get delta compression 
at commit time with little overhead if you wish to do that.  And this 
approach has the advantage of keeping the latest object revisions fast 
accessible and the delta overhead is relegated to the old historic 
objects.

And suppose the delta chain is too deep for some objects and accessing 
them gets too much overhead.  No problem: just pick a random object in 
the middle of the delta chain and swap it with its original undeltafied 
version and the delta chain is now cut in two.

Etc.  It's flexible and open to any arrangement.

OK, here's a revised patch correcting the little bug found by
Chris Mason.

==========

This patch adds the necessary functionalities to perform delta
compression on objects.  It adds a git-mkdelta command which can replace
any object with its deltafied version given a reference object.

Access to a delta object will transparently fetch the reference object
and apply the transformation.  Scripts can be used to perform any sort
of compression policy on top of it.

The delta generator has been extracted from libxdiff and optimized for
git usage in order to avoid as much data copy as possible, and the delta
storage format modified to be even more compact.  Therefore no need to
rely on any external library.  The test-delta program can be used to
test it.

Many refinements are needed but better merge them separately.  Loop 
detection and recursion treshold are a few examples.

Signed-off-by: Nicolas Pitre <nico@cam.org>

--- a/Makefile
+++ b/Makefile
@@ -29,7 +29,7 @@ install: $(PROG) $(SCRIPTS)
 	install $(PROG) $(SCRIPTS) $(HOME)/bin/
 
 LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o \
-	 tag.o date.o
+	 tag.o date.o diff-delta.o patch-delta.o
 LIB_FILE=libgit.a
 LIB_H=cache.h object.h blob.h tree.h commit.h tag.h
 
@@ -63,6 +63,9 @@ $(LIB_FILE): $(LIB_OBJS)
 test-date: test-date.c date.o
 	$(CC) $(CFLAGS) -o $@ test-date.c date.o
 
+test-delta: test-delta.c diff-delta.o patch-delta.o
+	$(CC) $(CFLAGS) -o $@ $^
+
 git-%: %.c $(LIB_FILE)
 	$(CC) $(CFLAGS) -o $@ $(filter %.c,$^) $(LIBS)
 
@@ -92,6 +95,7 @@ git-rpush: rsh.c
 git-rpull: rsh.c pull.c
 git-rev-list: rev-list.c
 git-mktag: mktag.c
+git-mkdelta: mkdelta.c
 git-diff-tree-helper: diff-tree-helper.c
 git-tar-tree: tar-tree.c
 git-write-blob: write-blob.c
Created: delta.h (mode:100644)
--- /dev/null
+++ b/delta.h
@@ -0,0 +1,6 @@
+extern void *diff_delta(void *from_buf, unsigned long from_size,
+			void *to_buf, unsigned long to_size,
+		        unsigned long *delta_size);
+extern void *patch_delta(void *src_buf, unsigned long src_size,
+			 void *delta_buf, unsigned long delta_size,
+			 unsigned long *dst_size);
Created: diff-delta.c (mode:100644)
--- /dev/null
+++ b/diff-delta.c
@@ -0,0 +1,315 @@
+/*
+ * diff-delta.c: generate a delta between two buffers
+ *
+ *  Many parts of this file have been lifted from LibXDiff version 0.10.
+ *  http://www.xmailserver.org/xdiff-lib.html
+ *
+ *  LibXDiff was written by Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2003	Davide Libenzi
+ *
+ *  Many mods for GIT usage by Nicolas Pitre <nico@cam.org>, (C) 2005.
+ *
+ *  This file is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ */
+
+#include <stdlib.h>
+#include "delta.h"
+
+
+/* block size: min = 16, max = 64k, power of 2 */
+#define BLK_SIZE 16
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+#define GR_PRIME 0x9e370001
+#define HASH(v, b) (((unsigned int)(v) * GR_PRIME) >> (32 - (b)))
+	
+/* largest prime smaller than 65536 */
+#define BASE 65521
+
+/* NMAX is the largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1 */
+#define NMAX 5552
+
+#define DO1(buf, i)  { s1 += buf[i]; s2 += s1; }
+#define DO2(buf, i)  DO1(buf, i); DO1(buf, i + 1);
+#define DO4(buf, i)  DO2(buf, i); DO2(buf, i + 2);
+#define DO8(buf, i)  DO4(buf, i); DO4(buf, i + 4);
+#define DO16(buf)    DO8(buf, 0); DO8(buf, 8);
+
+static unsigned int adler32(unsigned int adler, const unsigned char *buf, int len)
+{
+	int k;
+	unsigned int s1 = adler & 0xffff;
+	unsigned int s2 = adler >> 16;
+
+	while (len > 0) {
+		k = MIN(len, NMAX);
+		len -= k;
+		while (k >= 16) {
+			DO16(buf);
+			buf += 16;
+			k -= 16;
+		}
+		if (k != 0)
+			do {
+				s1 += *buf++;
+				s2 += s1;
+			} while (--k);
+		s1 %= BASE;
+		s2 %= BASE;
+	}
+
+	return (s2 << 16) | s1;
+}
+
+static unsigned int hashbits(unsigned int size)
+{
+	unsigned int val = 1, bits = 0;
+	while (val < size && bits < 32) {
+		val <<= 1;
+	       	bits++;
+	}
+	return bits ? bits: 1;
+}
+
+typedef struct s_chanode {
+	struct s_chanode *next;
+	int icurr;
+} chanode_t;
+
+typedef struct s_chastore {
+	chanode_t *head, *tail;
+	int isize, nsize;
+	chanode_t *ancur;
+	chanode_t *sncur;
+	int scurr;
+} chastore_t;
+
+static void cha_init(chastore_t *cha, int isize, int icount)
+{
+	cha->head = cha->tail = NULL;
+	cha->isize = isize;
+	cha->nsize = icount * isize;
+	cha->ancur = cha->sncur = NULL;
+	cha->scurr = 0;
+}
+
+static void *cha_alloc(chastore_t *cha)
+{
+	chanode_t *ancur;
+	void *data;
+
+	ancur = cha->ancur;
+	if (!ancur || ancur->icurr == cha->nsize) {
+		ancur = malloc(sizeof(chanode_t) + cha->nsize);
+		if (!ancur)
+			return NULL;
+		ancur->icurr = 0;
+		ancur->next = NULL;
+		if (cha->tail)
+			cha->tail->next = ancur;
+		if (!cha->head)
+			cha->head = ancur;
+		cha->tail = ancur;
+		cha->ancur = ancur;
+	}
+
+	data = (void *)ancur + sizeof(chanode_t) + ancur->icurr;
+	ancur->icurr += cha->isize;
+	return data;
+}
+
+static void cha_free(chastore_t *cha)
+{
+	chanode_t *cur = cha->head;
+	while (cur) {
+		chanode_t *tmp = cur;
+		cur = cur->next;
+		free(tmp);
+	}
+}
+
+typedef struct s_bdrecord {
+	struct s_bdrecord *next;
+	unsigned int fp;
+	const unsigned char *ptr;
+} bdrecord_t;
+
+typedef struct s_bdfile {
+	const unsigned char *data, *top;
+	chastore_t cha;
+	unsigned int fphbits;
+	bdrecord_t **fphash;
+} bdfile_t;
+
+static int delta_prepare(const unsigned char *buf, int bufsize, bdfile_t *bdf)
+{
+	unsigned int fphbits;
+	int i, hsize;
+	const unsigned char *base, *data, *top;
+	bdrecord_t *brec;
+	bdrecord_t **fphash;
+
+	fphbits = hashbits(bufsize / BLK_SIZE + 1);
+	hsize = 1 << fphbits;
+	fphash = malloc(hsize * sizeof(bdrecord_t *));
+	if (!fphash)
+		return -1;
+	for (i = 0; i < hsize; i++)
+		fphash[i] = NULL;
+	cha_init(&bdf->cha, sizeof(bdrecord_t), hsize / 4 + 1);
+
+	bdf->data = data = base = buf;
+	bdf->top = top = buf + bufsize;
+	data += (bufsize / BLK_SIZE) * BLK_SIZE;
+	if (data == top)
+		data -= BLK_SIZE;
+
+	for ( ; data >= base; data -= BLK_SIZE) {
+		brec = cha_alloc(&bdf->cha);
+		if (!brec) {
+			cha_free(&bdf->cha);
+			free(fphash);
+			return -1;
+		}
+		brec->fp = adler32(0, data, MIN(BLK_SIZE, top - data));
+		brec->ptr = data;
+		i = HASH(brec->fp, fphbits);
+		brec->next = fphash[i];
+		fphash[i] = brec;
+	}
+
+	bdf->fphbits = fphbits;
+	bdf->fphash = fphash;
+
+	return 0;
+}
+
+static void delta_cleanup(bdfile_t *bdf)
+{
+	free(bdf->fphash);
+	cha_free(&bdf->cha);
+}
+
+#define COPYOP_SIZE(o, s) \
+    (!!(o & 0xff) + !!(o & 0xff00) + !!(o & 0xff0000) + !!(o & 0xff000000) + \
+     !!(s & 0xff) + !!(s & 0xff00) + 1)
+
+void *diff_delta(void *from_buf, unsigned long from_size,
+		 void *to_buf, unsigned long to_size,
+		 unsigned long *delta_size)
+{
+	int i, outpos, outsize, inscnt, csize, msize, moff;
+	unsigned int fp;
+	const unsigned char *data, *top, *ptr1, *ptr2;
+	unsigned char *out, *orig;
+	bdrecord_t *brec;
+	bdfile_t bdf;
+
+	if (delta_prepare(from_buf, from_size, &bdf))
+		return NULL;
+	
+	outpos = 0;
+	outsize = 4096;
+	out = malloc(outsize);
+	if (!out) {
+		delta_cleanup(&bdf);
+		return NULL;
+	}
+
+	data = to_buf;
+	top = to_buf + to_size;
+
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size;
+
+	inscnt = 0;
+	moff = 0;
+	while (data < top) {
+		msize = 0;
+		fp = adler32(0, data, MIN(top - data, BLK_SIZE));
+		i = HASH(fp, bdf.fphbits);
+		for (brec = bdf.fphash[i]; brec; brec = brec->next) {
+			if (brec->fp == fp) {
+				csize = bdf.top - brec->ptr;
+				if (csize > top - data)
+					csize = top - data;
+				for (ptr1 = brec->ptr, ptr2 = data; 
+				     csize && *ptr1 == *ptr2;
+				     csize--, ptr1++, ptr2++);
+
+				csize = ptr1 - brec->ptr;
+				if (csize > msize) {
+					moff = brec->ptr - bdf.data;
+					msize = csize;
+					if (msize >= 0x10000) {
+						msize = 0x10000;
+						break;
+					}
+				}
+			}
+		}
+
+		if (!msize || msize < COPYOP_SIZE(moff, msize)) {
+			if (!inscnt)
+				outpos++;
+			out[outpos++] = *data++;
+			inscnt++;
+			if (inscnt == 0x7f) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+		} else {
+			if (inscnt) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+
+			data += msize;
+			orig = out + outpos++;
+			i = 0x80;
+
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
+
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x10; }
+			msize >>= 8;
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x20; }
+
+			*orig = i;
+		}
+
+		/* next time around the largest possible output is 1 + 4 + 3 */
+		if (outpos > outsize - 8) {
+			void *tmp = out;
+			outsize = outsize * 3 / 2;
+			out = realloc(out, outsize);
+			if (!out) {
+				free(tmp);
+				delta_cleanup(&bdf);
+				return NULL;
+			}
+		}
+	}
+
+	if (inscnt)
+		out[outpos - inscnt - 1] = inscnt;
+
+	delta_cleanup(&bdf);
+	*delta_size = outpos;
+	return out;
+}
Created: patch-delta.c (mode:100644)
--- /dev/null
+++ b/patch-delta.c
@@ -0,0 +1,73 @@
+/*
+ * patch-delta.c:
+ * recreate a buffer from a source and the delta produced by diff-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdlib.h>
+#include <string.h>
+#include "delta.h"
+
+void *patch_delta(void *src_buf, unsigned long src_size,
+		  void *delta_buf, unsigned long delta_size,
+		  unsigned long *dst_size)
+{
+	const unsigned char *data, *top;
+	unsigned char *dst, *out;
+	int size;
+
+	/* the smallest delta size possible is 10 bytes */
+	if (delta_size < 10)
+		return NULL;
+
+	data = delta_buf;
+	top = delta_buf + delta_size;
+
+	/* make sure the orig file size matches what we expect */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	if (size != src_size)
+		return NULL;
+
+	/* now the result size */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	dst = malloc(size);
+	if (!dst)
+		return NULL;
+
+	out = dst;
+	while (data < top) {
+		unsigned char cmd = *data++;
+		if (cmd & 0x80) {
+			unsigned int cp_off = 0, cp_size = 0;
+			if (cmd & 0x01) cp_off = *data++;
+			if (cmd & 0x02) cp_off |= (*data++ << 8);
+			if (cmd & 0x04) cp_off |= (*data++ << 16);
+			if (cmd & 0x08) cp_off |= (*data++ << 24);
+			if (cmd & 0x10) cp_size = *data++;
+			if (cmd & 0x20) cp_size |= (*data++ << 8);
+			if (cp_size == 0) cp_size = 0x10000;
+			memcpy(out, src_buf + cp_off, cp_size);
+			out += cp_size;
+		} else {
+			memcpy(out, data, cmd);
+			out += cmd;
+			data += cmd;
+		}
+	}
+
+	/* sanity check */
+	if (data != top || out - dst != size) {
+		free(dst);
+		return NULL;
+	}
+
+	*dst_size = size;
+	return dst;
+}
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -8,6 +8,7 @@
  */
 #include <stdarg.h>
 #include "cache.h"
+#include "delta.h"
 
 const char *sha1_file_directory = NULL;
 
@@ -186,7 +187,8 @@ void * unpack_sha1_file(void *map, unsig
 	int ret, bytes;
 	z_stream stream;
 	char buffer[8192];
-	char *buf;
+	char *buf, *delta_ref;
+	unsigned long delta_ref_sz;
 
 	/* Get the data stream */
 	memset(&stream, 0, sizeof(stream));
@@ -201,8 +203,15 @@ void * unpack_sha1_file(void *map, unsig
 		return NULL;
 	if (sscanf(buffer, "%10s %lu", type, size) != 2)
 		return NULL;
-
 	bytes = strlen(buffer) + 1;
+
+	if (!strcmp(type, "delta")) {
+		delta_ref = read_sha1_file(buffer + bytes, type, &delta_ref_sz);
+		if (!delta_ref)
+			return NULL;
+	} else
+		delta_ref = NULL;
+
 	buf = xmalloc(*size);
 
 	memcpy(buf, buffer + bytes, stream.total_out - bytes);
@@ -214,6 +223,17 @@ void * unpack_sha1_file(void *map, unsig
 			/* nothing */;
 	}
 	inflateEnd(&stream);
+
+	if (delta_ref) {
+		char *newbuf;
+		unsigned long newsize;
+		newbuf = patch_delta(delta_ref, delta_ref_sz, buf+20, *size-20, &newsize);
+		free(delta_ref);
+		free(buf);
+		buf = newbuf;
+		*size = newsize;
+	}
+
 	return buf;
 }
 
Created: test-delta.c (mode:100644)
--- /dev/null
+++ b/test-delta.c
@@ -0,0 +1,79 @@
+/*
+ * test-delta.c: test code to exercise diff-delta.c and patch-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include "delta.h"
+
+static const char *usage =
+	"test-delta (-d|-p) <from_file> <data_file> <out_file>";
+
+int main(int argc, char *argv[])
+{
+	int fd;
+	struct stat st;
+	void *from_buf, *data_buf, *out_buf;
+	unsigned long from_size, data_size, out_size;
+
+	if (argc != 5 || (strcmp(argv[1], "-d") && strcmp(argv[1], "-p"))) {
+		fprintf(stderr, "Usage: %s\n", usage);
+		return 1;
+	}
+
+	fd = open(argv[2], O_RDONLY);
+	if (fd < 0 || fstat(fd, &st)) {
+		perror(argv[2]);
+		return 1;
+	}
+	from_size = st.st_size;
+	from_buf = mmap(NULL, from_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (from_buf == MAP_FAILED) {
+		perror(argv[2]);
+		return 1;
+	}
+	close(fd);
+
+	fd = open(argv[3], O_RDONLY);
+	if (fd < 0 || fstat(fd, &st)) {
+		perror(argv[3]);
+		return 1;
+	}
+	data_size = st.st_size;
+	data_buf = mmap(NULL, data_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (data_buf == MAP_FAILED) {
+		perror(argv[3]);
+		return 1;
+	}
+	close(fd);
+
+	if (argv[1][1] == 'd')
+		out_buf = diff_delta(from_buf, from_size,
+				     data_buf, data_size, &out_size);
+	else
+		out_buf = patch_delta(from_buf, from_size,
+				      data_buf, data_size, &out_size);
+	if (!out_buf) {
+		fprintf(stderr, "delta operation failed (returned NULL)\n");
+		return 1;
+	}
+
+	fd = open (argv[4], O_WRONLY|O_CREAT|O_TRUNC, 0666);
+	if (fd < 0 || write(fd, out_buf, out_size) != out_size) {
+		perror(argv[4]);
+		return 1;
+	}
+
+	return 0;
+}

^ permalink raw reply

* Re: RFC: adding xdelta compression to git
From: C. Scott Ananian @ 2005-05-03 15:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alon Ziv, git
In-Reply-To: <Pine.LNX.4.58.0505022131380.3594@ppc970.osdl.org>

On Mon, 2 May 2005, Linus Torvalds wrote:

>> * Changes the repository format.
>
> It wouldn't necessarily. You should be able to do this with _zero_ changes
> to existing objects what-so-ever.

Yes.  The 'chunking' code I posted earlier does this, etc.  It's kinda odd 
computing a SHA-1 including the 'blob <size>\0' header, even when your 
representation doesn't use this type exactly, but it's no big deal.  I'm 
still tinkering with this, btw; I can get modest improvements in 'real' disk 
space used, but nothing earth-shattering (yet).  I'll post the list of 
things I tried and how well they worked at some point, just to save people 
the effort of retrying things.

I've been working from the 'no knowledge of commit structure needed' 
perspective; I think Chris Mason has been using the structure of the 
commit object to guide delta-fication and showing more impressive 
space savings.
  --scott

HTAUTOMAT Legion of Doom payment PBPRIME insurgent shortwave AVBUSY 
Nader PBCABOOSE overthrow explosion Ortega STANDEL ECJOB Sigint FBI
                          ( http://cscott.net/ )

^ permalink raw reply

* Re: questions about cg-update, cg-pull, and cg-clone.
From: Zack Brown @ 2005-05-03 15:22 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Git Mailing List
In-Reply-To: <20050502195846.GC20818@pasky.ji.cz>

On Mon, May 02, 2005 at 09:58:46PM +0200, Petr Baudis wrote:
> Dear diary, on Sat, Apr 30, 2005 at 02:53:22AM CEST, I got a letter
> where Zack Brown <zbrown@tumblerings.org> told me that...
> > 'cg-update branch-name' grabs any new changes from the upstream repository and
> > merges them into my local repository. If I've been editing files in my local
> > repository, the update attempts to merge the changes cleanly.
> 
> Yes.
> 
> > Now, if the update is clean, a cg-commit is invoked automatically, and if the
> > update is not clean, I then have to resolve any conflicts and give the cg-commit
> > command by hand. But: what is the significance of either of these cg-commit
> > commands? Why should I have to write a changelog entry recording this merge? All
> 
> You might want to write some special notes regarding the merge, e.g.
> when you want to describe some non-trivial conflict resolution, or even
> give a short blurb of the changes you are merging.
> 
> If you don't know what to say, just press Ctrl-D. The first line of the
> commit always says "Merge with what_you_are_merging_with".
> 
> > I'm doing is updating my tree to be current. Why should I have to 'commit' that
> > update?
> 
> If you are only updating your tree to be current, you don't have to
> commit, and in fact you don't commit (you do so-called "fast-forward
> merge", which will just update your HEAD pointer to point at the newer
> commit). You commit only when you were merging stuff (so-called "tree
> merge"; well, that's at least how I call it to differentiate it from the
> fast-forward merge). That means you have some local commits over there -
> I can't just update your tree to be current, sorry. That would lose your
> commit. I have to merge the changes into your tree through a merge
> commit.

Hm.

So, suppose I'm working on your Cogito HEAD. I make some changes to my local
tree and commit them to my tree, and then before I go forward, I want to grab
whatever you've done recently, to make sure we're not in conflict before I add
new changes. If I understand you right, this situation would be a 'fast forward
merge'. So what is the command I give to just 'merge' your HEAD with mine,
without requiring a changelog entry?

Alternatively, suppose I'm you, the project lead, and Zackdude has some
changes for me, based on my HEAD. I want to 'merge' his tree into mine. If
I'm still understanding you, this is a 'tree merge'. Now I give a cg-update,
and now I *want* to give a changelog entry to record the merge.  Correct?

No, I still don't see it. I don't see why I would want to add an additional
changelog entry on top of whatever changelog entries Zackdude has made himself.
It just seems to pollute the changelog with entries that are essentially
meaningless. When I read back over the logs, I'm not going to be interested in
the bookkeeping of when I merged with various developers, I'm going to be
interested in what those developers actually did to the code, and what *I*
actually did to the code.

> 
> > Now I look at 'cg-pull'. What does this do? The readme says something about
> > printing two ids, and being useful for diffs. But can't I do a diff after a
> > cg-update and get the same result? I'm very confused about cg-pull right now.
> 
> cg-pull does the first part of cg-update. It is concerned by fetching
> the stuff from the remote repository to the local one. cg-merge then
> does the second part, merging the stuff to your local tree (doing either
> fast-forward or tree merge).

OK, I don't understand this either. What is the difference between fetching the
stuff and merging the stuff? Suppose I am working on a local repo of Cogito
HEAD. I make some changes, commit them, and then I do a cg-pull. What happens?
Are my changes overwritten? Do they show up at all? Do they exist in some
nebulous ether that I will never see until I do a merge?

Be well,
Zack

> 
> -- 
> 				Petr "Pasky" Baudis
> Stuff: http://pasky.or.cz/
> C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Zack Brown

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Linus Torvalds @ 2005-05-03 15:07 UTC (permalink / raw)
  To: Chris Mason; +Cc: Nicolas Pitre, Alon Ziv, git
In-Reply-To: <200505030724.57827.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:
> 
> On the full import of all the bk->cvs changesets, the average file size 
> in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.

Have you checked how many of those are blobs?

For many commits, we generate as many (or more) _tree_ objects as we 
generate blobs. 

And tree obejcts from the same "supertree" really is something that I
wouldn't mind packing some way, because they really tend to be very much
related (since they refer to each other). Eg the commit and the top-level
tree are almost always a pair, since you'd get a shared top-level tree
only with two commits that have the exact same content (which definitely
happens, don't get me wrong, but it we get some duplication for that case,
we'd still be winning).

		Linus

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Nicolas Pitre @ 2005-05-03 15:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <200505031037.38005.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:

> On Tuesday 03 May 2005 10:24, Nicolas Pitre wrote:
> > On Tue, 3 May 2005, Chris Mason wrote:
> > > Hmmm, something is strange here, am I using this wrong?
> > >
> > > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> > > Aborted
> >
> > Can you send me your foo and delta2 files?
> >
> Sorry, thought I had the whole command history in there.  I went for something 
> small to start ;)
> 
> coffee:~/git/linus.orig # echo foo > foo
> coffee:~/git/linus.orig # echo foo2 > foo2
> coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> coffee:~/git/linus.orig # ls -la delta1
> -rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
> coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***

OK, doh!

--- diff-delta.c.orig	2005-05-03 11:00:39.900529634 -0400
+++ diff-delta.c	2005-05-03 11:01:03.210031176 -0400
@@ -307,7 +307,7 @@
 	}
 
 	if (inscnt)
-		out[-inscnt - 1] = inscnt;
+		out[outpos - inscnt - 1] = inscnt;
 
 	delta_cleanup(&bdf);
 	*delta_size = outpos;

^ permalink raw reply

* Re: More problems...
From: Andreas Gal @ 2005-05-03 15:00 UTC (permalink / raw)
  To: Petr Baudis
  Cc: Linus Torvalds, Anton Altaparmakov, Russell King, Junio C Hamano,
	Ryan Anderson, git
In-Reply-To: <20050503014816.GQ20818@pasky.ji.cz>


I am just soft-linking objects/ in the branched tree. I can live with 
dangling objects, branching is extremly fast, and diskspace is cheap 
anyway. The only downside is that it doesn't work too well with rsync as 
network protocol, but I use only http-pull and rpush anyway.

Andreas

On Tue, 3 May 2005, Petr Baudis wrote:

> Dear diary, on Tue, May 03, 2005 at 12:19:16AM CEST, I got a letter
> where Linus Torvalds <torvalds@osdl.org> told me that...
> > But for "normal" situations, where you have a tree or two, the hardlinking 
> > win might not be big enough to warrant the maintenance headache. With 
> > hardlinking, you _do_ need to "trust" the other trees to some degree.
> 
> As long as the trees aren't yours and you aren't doing something really
> horrible with them...
> 
> $ time git-local-pull -a -l $(cat ~/git-devel/.git/HEAD) ~/git-devel/.git/
> real    0m0.332s
> 
> $ time git-local-pull -a $(cat ~/git-devel/.git/HEAD) ~/git-devel/.git/
> real    0m4.306s
> 
> And this is only 13M Cogito objects database. I think one of the
> important things is to encourage branching, therefore it must be fast
> enough; that's why I really wanted to do hardlinks. The disk space is
> important, but the speed hit probably equally (if not more) so.
> 
> BTW, the object database files should have 0444 or such; they really
> _are_ read-only and making them so mode-wise could help against some
> mistakes too.
> 
> It's clear that Cogito should have a way to choose whether to hardlink
> or copy; the question is which one should be the default one and how
> should it be specified.  I thought about using file:// vs. just local
> path to differentiate between copy and hardlinking, but that'd be
> totally non-obvious, therefore bad UI-wise.
> 
> BTW, I've just committed support for pulling from remote repositories
> over the HTTP and SSH protocols (http://your.git/repo,
> git+ssh://root@git.nasa.gov/srv/git/mars) (note that I was unable to
> test the SSH stuff properly now; success reports or patches welcome).
> Also, the local hardlinking access is now done over git-local-pull,
> therefore the cp errors should go away now.
> 
> I'm not yet decided whether locations like
> 
> 	kernel.org:/pub/scm/cogito/cogito.git
> 
> should invoke rsync, rpull, throw an error or print a fortune cookie.
> 
> -- 
> 				Petr "Pasky" Baudis
> Stuff: http://pasky.or.cz/
> C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Linus Torvalds @ 2005-05-03 14:48 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Alon Ziv, Git Mailing List
In-Reply-To: <Pine.LNX.4.62.0505030344170.14033@localhost.localdomain>

On Tue, 3 May 2005, Nicolas Pitre wrote:
> 
> Yep, that's what I've done last weekend (and just made it actually 
> work since people are getting interested).

I have to say that it looks uncommonly simple. Also, afaik, this should
still work with the current fsck, it's just that because fsck doesn't
understand the linkages, the error reporting won't be as good as it could
be (I'd _much_ rather see "delta failed in object xxxxx" than "unable to
read xxxxxx").

Now, one thing I like about this approach is that the actual delta 
_generation_ can be done off-line, and independently of anything else. 
Which means that the performance paths I care about (commit etc) are 
largely unaffected, and you can "deltify" a git archive overnight or 
something. 

In fact, it means that you might even be able to use some fairly expensive 
"search for the best blob object to delta against", including very much a 
intelligent rename search (ie "oh, this is a new object, let's see if any 
of the old deleted objects generate a good delta"), but you might even go 
back more than one generation.

Hmm. How nasty are those scripts?

		Linus

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 14:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.62.0505031022340.14033@localhost.localdomain>

On Tuesday 03 May 2005 10:24, Nicolas Pitre wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > Hmmm, something is strange here, am I using this wrong?
> >
> > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> > Aborted
>
> Can you send me your foo and delta2 files?
>
Sorry, thought I had the whole command history in there.  I went for something 
small to start ;)

coffee:~/git/linus.orig # echo foo > foo
coffee:~/git/linus.orig # echo foo2 > foo2
coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
coffee:~/git/linus.orig # ls -la delta1
-rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
*** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***

-chris

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Nicolas Pitre @ 2005-05-03 14:24 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <200505031013.57476.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:

> On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> > On Mon, 2 May 2005, Linus Torvalds wrote:
> > > If you do something like this, you want such a delta-blob to be named by
> > > the sha1 of the result, so that things that refer to it can transparently
> > > see either the original blob _or_ the "deltified" one, and will never
> > > care.
> >
> > Yep, that's what I've done last weekend (and just made it actually
> > work since people are getting interested).
> 
> Hmmm, something is strange here, am I using this wrong?
> 
> coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> Aborted

Can you send me your foo and delta2 files?


Nicolas

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 14:13 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.62.0505030344170.14033@localhost.localdomain>

On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> On Mon, 2 May 2005, Linus Torvalds wrote:
> > If you do something like this, you want such a delta-blob to be named by
> > the sha1 of the result, so that things that refer to it can transparently
> > see either the original blob _or_ the "deltified" one, and will never
> > care.
>
> Yep, that's what I've done last weekend (and just made it actually
> work since people are getting interested).

Hmmm, something is strange here, am I using this wrong?

coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
*** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
Aborted

Valgrind output:

==9634== Invalid read of size 1
==9634==    at 0x1B9036F0: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906F is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036F3: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08D is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B9036F6: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906E is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036F9: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08C is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B9036FC: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906D is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036FF: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08B is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B903702: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906C is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B903708: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08A is not stack'd, malloc'd or (recently) free'd
delta operation failed (returned NULL)
==9634==
==9634== ERROR SUMMARY: 206 errors from 13 contexts (suppressed: 0 from 0)
==9634== malloc/free: in use at exit: 0 bytes in 0 blocks.
==9634== malloc/free: 1 allocs, 1 frees, 5 bytes allocated.
==9634== For a detailed leak analysis,  rerun with: --leak-check=yes
==9634== For counts of detected errors, rerun with: -v

-chris

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Nicolas Pitre @ 2005-05-03 12:51 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <200505030724.57827.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:

> This looks much nicer than using zdelta, I'll try switching my packed item to 
> your delta generator later this week.  Some quick and dirty space numbers to 
> show why we need to pack the files together:
> 
> On the full import of all the bk->cvs changesets, the average file size 
> in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.
> 
> This means that of the 2.5GB the .git directory consumes, about 1GB is taken 
> up by files under 4k where deltas won't save space.  If the remaining files 
> could be delta compressed down to less than 4k, they would still take up 
> around 400MB on disk.

Sure.  However it helps for history backups and network transfer.

However if the delta compression and packed storage can remain as 
decoupled as possible from each other this is good for flexibility.


Nicolas

^ permalink raw reply

* Re: RFC: adding xdelta compression to git
From: Dan Holmsand @ 2005-05-03 12:48 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.58.0505022131380.3594@ppc970.osdl.org>

Linus Torvalds wrote:
> Also, the fact is, since git saves things as separate files, you'd not win
> as much as you would with some other backing store. So the second step is
> to start packing the objects etc. I think there is actually a very steep
> complexity edge here - not because any of the individual steps necessarily
> add a whole lot, but because they all lead to the "next step".

Actually, you can win quite a lot.

I've just been playing with storing the entire 
linux-2.4.0-to-2.6.12-rc2-patchset as xdelta patches in git. The entire 
thing ended up being 577M (instead of some 3.5G), according to du -sh 
--apparent-size. Considering that that's some 800M of patches, that's 
not too bad.

I used a very simple scheme: I stored a delta to the previous version of 
every file if that delta was less than 20% in size of the new file 
(otherwise, the whole file was stored as usual). If the previous version 
was already in delta form, the delta was computed from that versions 
"parent". I didn't even care to look at files less than 4k in size.

In other words, I didn't have to use any delta "chains", and still got 
quite massive storage size gains. And this scheme could easily be used 
on the fly; I'm guessing that it would be performance neutral (or even a 
slight gain, since less has to be compressed and written to disk on 
commits, and there might be less to read when diffing, since two 
"delta-blobs" might share the same parent).

Or, with careful tuning, a repo might be "deltaified" later on (assuming 
that delta blobs are addressed using the expanded files' hash).

/dan

^ permalink raw reply

* 'read-tree -m head' vs 'read-tree head'
From: Thomas Glanzmann @ 2005-05-03 12:49 UTC (permalink / raw)
  To: GIT

Hello,
I see in Linus merge script

	read-tree -m $merge_tree && checkout-cache -f -a && update-cache --refresh

Does this the same as read-tree $merge_tree would do?

	Thomas

^ permalink raw reply

* Re: gitweb on kernel.org lies?
From: David Woodhouse @ 2005-05-03 11:43 UTC (permalink / raw)
  To: Kay Sievers; +Cc: git
In-Reply-To: <1115117537.21105.53.camel@localhost.localdomain>

On Tue, 2005-05-03 at 12:52 +0200, Kay Sievers wrote:
> 
> The HEAD:
>   ftp://kernel.org/pub/scm/linux/kernel/git/dwmw2/audit-2.6.git/HEAD
> 
> points to:
>   42d4dc3f4e1ec1396371aac89d0dccfdd977191b
>   authorBenjamin Herrenschmidt <benh@kernel.crashing.org>
>   Fri, 29 Apr 2005 14:40:12 +0000 (07:40 -0700)
> 
> which is 3 days old, right?

Er, yeah... the active one is actually 'audit-2.6' not 'audit-2.6.git'.
Mea culpa -- I'd forgotten the latter was there, but it's the only one
that gitweb shows. 

-- 
dwmw2

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 11:24 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.62.0505030344170.14033@localhost.localdomain>

On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> On Mon, 2 May 2005, Linus Torvalds wrote:
> > If you do something like this, you want such a delta-blob to be named by
> > the sha1 of the result, so that things that refer to it can transparently
> > see either the original blob _or_ the "deltified" one, and will never
> > care.
>
> Yep, that's what I've done last weekend (and just made it actually
> work since people are getting interested).
>
This looks much nicer than using zdelta, I'll try switching my packed item to 
your delta generator later this week.  Some quick and dirty space numbers to 
show why we need to pack the files together:

On the full import of all the bk->cvs changesets, the average file size 
in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.

This means that of the 2.5GB the .git directory consumes, about 1GB is taken 
up by files under 4k where deltas won't save space.  If the remaining files 
could be delta compressed down to less than 4k, they would still take up 
around 400MB on disk.

-chris


^ permalink raw reply

* Re: gitweb on kernel.org lies?
From: Kay Sievers @ 2005-05-03 10:52 UTC (permalink / raw)
  To: David Woodhouse; +Cc: git
In-Reply-To: <1115109604.8508.34.camel@localhost.localdomain>

On Tue, 2005-05-03 at 09:40 +0100, David Woodhouse wrote:
> http://www.kernel.org/git/gitweb.cgi?p=linux%2Fkernel%2Fgit%2Fdwmw2%2Faudit-2.6.git;a=log
> doesn't seem to show the commits which were just put there. Why?

The HEAD:
  ftp://kernel.org/pub/scm/linux/kernel/git/dwmw2/audit-2.6.git/HEAD

points to:
  42d4dc3f4e1ec1396371aac89d0dccfdd977191b
  authorBenjamin Herrenschmidt <benh@kernel.crashing.org>
  Fri, 29 Apr 2005 14:40:12 +0000 (07:40 -0700)

which is 3 days old, right?

Kay

^ permalink raw reply

* Re: cogito "origin" vs. HEAD
From: Petr Baudis @ 2005-05-03  9:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Git Mailing List
In-Reply-To: <1115104408.6156.100.camel@gaston>

Dear diary, on Tue, May 03, 2005 at 09:13:28AM CEST, I got a letter
where Benjamin Herrenschmidt <benh@kernel.crashing.org> told me that...
> > when accessing the remote repository, Cogito always looks for remote
> > refs/heads/master first - if that one isn't there, it takes HEAD, but
> > there is no correlation between the local and remote branch name. If you
> > want to fetch a different branch from the remote repository, use the
> > fragment identifier (see cg-help cg-branch-add).
> 
> Ok, that I'm getting. So then, what happen of my local
> refs/heads/<branchname> and refs/heads/master/ ? I'm still a bit
> confused by the whole branch mecanism... It's my understanding than when
> I cg-init, it creates both "master" (a head without matching branch)
> and "origin" (a branch  + a head) both having the same sha1. It also
> checks out the tree.
> 
> Now, when I cg-update origin, what happens exactly ? I mean, I know it's
> pulls all objects, then get the master from the remote pointed by the
> origin branch, but then, I suppose it updates both my local "origin" and
> my local "master" pointer, right ? I mean, they are always in sync ? Or
> is this related to what branch my current checkout is tracking ?

They are in sync as long as you update only from that given branch.
At the moment you do a local commit, they get out of sync, at least
until your master branch is merged to the origin branch on the other
side. Every cg-update will then generate a merging commit, so it will
look like this:

     [origin]    [master]
            commit1
              |
            commit2               Both heads are in sync so far...
              |
            commit3
             /    \
            /     commit4         Now heads/master is commit4, but
           /        |             heads/origin is still commit3
          /         |
      commit5-.     |             heads/master:commit4, heads/origin:commit5
          |    \    |
          |     `-commit6         commit6 merges origin to master
          |       /
          |     /
          |   /
      commit6                     origin merged your master; since it
                                  contained all the commits on the origin
               |                  branch, it just took over the commit6
             commit6              commit pointer as its new head; so both
                                  heads are again in sync now

This is the reason why there are always at least two branches, origin
and master. The checked out tree is always of the master branch (unless
you do cg-seek, which is somewhat special anyway). [*] "Normally", when
you do no local changes and just always cg-update the origin branch, the
two branches are always in sync. At the point you start to "mix" several
remote branches besides origin in your tree, or at the point you do a
local commit, the master branch gets standalone - until the origin
merges your changes as drawn in the diagram.

There is one other situation when the head pointers may not be in sync -
when you do cg-pull instead of cg-update. You want to see what are the
changes in the origin branch, but you are not sure if you want them to
appear in your master branch, you do cg-pull origin. Your origin head
pointer is updated, but your master pointer stays where it is. If you
decide it's ok to bring the changes in, you do either cg-update, or only
cg-merge to avoid re-pulling.

[*] Technically, you can have multiple local branches and your tree can
be based on any of them, not only "master". Cogito supports that
internally, but (deliberately) provides no UI to set that up, at least
until we devise a way to do it without confusing people even more.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox