Git development

Git development
 help / color / mirror / Atom feed

* Re: RFC: adding xdelta compression to git
From: Davide Libenzi @ 2005-05-03 18:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: C. Scott Ananian, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.58.0505031031240.3594@ppc970.osdl.org>

On Tue, 3 May 2005, Linus Torvalds wrote:

> On Tue, 3 May 2005, C. Scott Ananian wrote:
> > 
> > Linus knows this.  His point is just to be sure you actually *code* that 
> > walk in fsck, and (hopefully) do so w/o complicating the fsck too much.
> 
> Indeed. It's also a performance issue.
> 
> If you do xdelta objects, and don't tell fsck about it, then fsck will 
> just check every object as a blob. Why is that bad?
> 
> Think about it: let's say that you have a series of xdelta objects, and a 
> fsck that is xdelta-unaware. It will unpack each object independently, 
> which means that it will keep on doing the same early xdelta work over and 
> over and over again. Instead of just applying them in order, and checking 
> the sha1 of the result at each point.
> 
> Now, You probably want to limit the length of the chains to some firly 
> small number anyway, so maybe that's not a big deal. Who knows. And I'm 
> actually still so anal that I don't think I'd use this for _my_ tree, just 
> because I'm a worry-wart (and I still think disk is incredibly cheap ;)

If you use a "full tip" metadata format with reverse deltas, you drop a 
"full" version "time to time" along the chain, and you keep a small index 
file, you have:

1) No matter how big it becomes the xdelta collection object, you are only 
   touching very limited regions of it (due the small index file, that can 
   be less than 20+8 bytes per entry in the xdelta blob)

2) Checkout happens w/out even doing xpatching (since the tip is full)

3) Checkins requires only one xdelta operation (since the tip is full), 
   and zero if it is the time to store a full version along the chain (I 
   use to drop one every 10-16 xdeltas, depending on the progressive size 
   of the delta operations)

4) Worst case performance in reconstructing histories are bound by the 
   longest xdelta chain (10-16)

In some way I tend to agree (strangely ;) with you about the disk-cheap 
mantra, but network bandwidth matter IMO. So, if you do not want (being a 
real worry-wart) to use xdelta leverage on the FS trees, you can have way 
smarter network protocols using xdelta plus the knowledge of the git 
history structure. The rsync algo uses xdelta, but the poor guy is not 
able to leverage from the knowledge of the history that only git knows. 
So, if Larry and Greg shares a common object A, Larry changes A and makes 
a new git object B, rsync will transfer the whole object B, because it 
does not have any idea of the git structure. Git though, has this 
knowledge, and it can say to the remote fetcher: Look, I have this new 
thing called B, that is basically your thing A plus this very small xdelta 
(B-A). And typical xdelta diffs are really small (1/7 to 1/10 of classical 
'diff -u' ones).

- Davide

^ permalink raw reply

* [PATCH] cogito: Updated cg-status -a
From: Matt Porter @ 2005-05-03 17:47 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git

Updated patch versus latest cogito and bug fix for a thinko.
If -a is passed, the same output is generated but it also shows
all modified but uncommitted files as well.

Signed-off-by: Matt Porter <mporter@kernel.crashing.org>

--- aa6233be6d1b8bf42797c409a7c23b50593afc99/cg-status  (mode:100755 sha1:9e7f0e59284a3d15cda35bbd5579c44d8eda05d5)
+++ ee35a6204e59cf47966080be20d8248a6e4aa3c3/cg-status  (mode:100755 sha1:dc821a1255f012a612aa4d25ffc551c32b017bd9)
@@ -3,7 +3,9 @@
 # Show status of entries in your working tree.
 # Copyright (c) Petr Baudis, 2005
 #
-# Takes no arguments.
+# Takes an optional -a argument which will cause all repository status
+# to be shown, including modified but uncommitted files
+
 
 . cg-Xlib
 
@@ -20,3 +22,16 @@
 	shift
 done
 ' padding
+
+if [ "$1" = "-a" ]; then
+	{
+		git-update-cache --refresh
+	} | cut -f 1 -d ":" | xargs sh -c '
+	while [ "$1" ]; do
+		tag="M";
+		filename=${1%: *};
+		echo "$tag $filename";
+		shift
+	done
+	' padding
+fi

^ permalink raw reply

* Re: RFC: adding xdelta compression to git
From: Linus Torvalds @ 2005-05-03 17:35 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: Davide Libenzi, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.61.0505031151380.32767@cag.csail.mit.edu>

On Tue, 3 May 2005, C. Scott Ananian wrote:
> 
> Linus knows this.  His point is just to be sure you actually *code* that 
> walk in fsck, and (hopefully) do so w/o complicating the fsck too much.

Indeed. It's also a performance issue.

If you do xdelta objects, and don't tell fsck about it, then fsck will 
just check every object as a blob. Why is that bad?

Think about it: let's say that you have a series of xdelta objects, and a 
fsck that is xdelta-unaware. It will unpack each object independently, 
which means that it will keep on doing the same early xdelta work over and 
over and over again. Instead of just applying them in order, and checking 
the sha1 of the result at each point.

Now, You probably want to limit the length of the chains to some firly 
small number anyway, so maybe that's not a big deal. Who knows. And I'm 
actually still so anal that I don't think I'd use this for _my_ tree, just 
because I'm a worry-wart (and I still think disk is incredibly cheap ;)

		Linus

^ permalink raw reply

* Re: questions about cg-update, cg-pull, and cg-clone.
From: Joel Becker @ 2005-05-03 17:20 UTC (permalink / raw)
  To: Zack Brown; +Cc: Petr Baudis, Git Mailing List
In-Reply-To: <20050503155915.GV4747@ca-server1.us.oracle.com>

On Tue, May 03, 2005 at 08:59:15AM -0700, Joel Becker wrote:
> 	Then you change the first file, adding a few functions.  You
> commit it, and it now has the hash 111111.  This change means the tree
> hash becomes 222222.  So, HEAD contains 222222.
> 	You then update from Petr again.  He's changed the second file.
> It's hash is no longer cccccc, it's eeeeee.  In his tree, the hash of
> the tree is 333333 (from file 1's aaaaaa and file 2's eeeeee).  But the
> hash of your tree is 444444 (from your local file 1's 111111 and file 2's eeeeee).  So, the hash of the your tree becomes 444444.  Your HEAD contains 444444.
> This does _not_ match his 333333 HEAD.  You are committing the
> combination of his change and yours.  He is saying that this work, which
> may have required hand-merging or commit resolution, is "interesting"
> information.

	Actually, it is more than interesting.  The tree has gone from a
HEAD of 222222 to a HEAD of 444444.  When HEAD changes, you need a
commit to describe the path.  Otherwise, you have a breakdown in the
history.  cg-log (or any other command) would have no way to get back
from 444444 to 222222 (or Petr's 333333) without the commit object
specifying its parent(s).
	If you have made no commits on your side, then the old HEAD is
Petr's old HEAD, the new HEAD is Petr's new 333333, and he's already
created a commit object describing this.  You're just fast-forwarding.

Joel

-- 

"The nice thing about egotists is that they don't talk about other
 people."
         - Lucille S. Harper

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Rene Scharfe @ 2005-05-03 17:14 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Matt Mackall,
	Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>,
	Linus Torvalds, Ryan Anderson, Andrea Arcangeli, linux-kernel,
	git
In-Reply-To: <4277A52E.1020601@tmr.com>

Bill Davidsen schrieb:
> On the theory that my first post got lost, why use /usr/bin/env at 
> all, when bash already does that substitution? To support people who 
> use other shells?
> 
> ie.: FOO=xx perl -e '$a=$ENV{FOO}; print "$a\n"'

/usr/bin/env is used in scripts in the shebang line (the very first line
of the script, starting with "#!", which denotes the interpreter to use
for that script) to make a PATH search for the real interpreter.
Some folks keep their python (or Perl, or Bash etc.) in /usr/local/bin
or in $HOME, that's why this construct is needed at all.

Changing environment variables is not the goal, insofar this usage
exploits only a side-effect of env.  It is portable in practice because
env is in /usr/bin on most modern systems.

So you could replace this first line of a bash script:

   #!/usr/bin/env python

with this:

   #!python

except that the latter doesn't work because you need to specify an
absolute path there. :]

Rene

^ permalink raw reply

* [PATCH] cogito: Add cg-undo command
From: Matt Porter @ 2005-05-03 17:06 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git

Adds a cg-undo command which takes a commit ID and resets HEAD
to the parent of the commit ID...refreshing the tree. This undoes
a single commit or a series of commits.

Signed-off-by: Matt Porter <mporter@kernel.crashing.org>

--- a1aff2a6748c0c0d08058c7d74503e724abc5d03/Makefile  (mode:100644 sha1:6ae0afa0208a8f755d383281a6d049a4ef90fe63)
+++ 023d9a7929d2f933d8e008f1679f13a58f7b1229/Makefile  (mode:100644 sha1:6c282aeebe86ecee9e634481b3d51fd53a582791)
@@ -47,7 +47,7 @@
 	cg-add cg-admin-lsobj cg-cancel cg-clone cg-commit cg-diff \
 	cg-export cg-help cg-init cg-log cg-ls cg-merge cg-mkpatch \
 	cg-patch cg-pull cg-branch-add cg-branch-ls cg-rm cg-seek cg-status \
-	cg-tag cg-tag-ls cg-update cg-Xlib
+	cg-tag cg-tag-ls cg-undo cg-update cg-Xlib
 
 COMMON=	read-cache.o
 
Index: cg-help
===================================================================
--- a1aff2a6748c0c0d08058c7d74503e724abc5d03/cg-help  (mode:100755 sha1:1f5d2d79b67490d44ce0f575ff9a4b80134ea47f)
+++ 023d9a7929d2f933d8e008f1679f13a58f7b1229/cg-help  (mode:100755 sha1:c7dc8f3e03895374cd0dae544570a37a459c2466)
@@ -43,6 +43,7 @@
 	cg-status
 	cg-tag		TNAME [COMMIT_ID]
 	cg-tag-ls
+	cg-undo		[COMMIT_ID]
 	cg-update	[BNAME]
 	cg-version
 
Index: cg-undo
===================================================================
--- /dev/null  (tree:a1aff2a6748c0c0d08058c7d74503e724abc5d03)
+++ 023d9a7929d2f933d8e008f1679f13a58f7b1229/cg-undo  (mode:100755 sha1:7fd6d89158fb5aeee42aa05a93f2c81884d9bd34)
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+#
+# Undo a commit or a series of commits
+# Copyright (C) Matt Porter, 2005
+#
+# Takes a commit ID which is the earliest commit to be
+# removed from the repository.
+
+. cg-Xlib
+
+PARENT=`git-cat-file commit $1 | grep parent | cut -f 2 -d " "`
+echo "Undo from $1 to current HEAD"
+echo "Reset HEAD to $PARENT"
+echo "$PARENT" > .git/HEAD
+git-read-tree -m "$PARENT" || {
+	echo >&2 "$PARENT: bad commit"
+	exit 1
+}
+git-checkout-cache -f -a
+git-update-cache --refresh

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 16:54 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.62.0505031104080.14033@localhost.localdomain>

On Tuesday 03 May 2005 11:04, Nicolas Pitre wrote:
> On Tue, 3 May 2005, Chris Mason wrote:

> > coffee:~/git/linus.orig # echo foo > foo
> > coffee:~/git/linus.orig # echo foo2 > foo2
> > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > coffee:~/git/linus.orig # ls -la delta1
> > -rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
> > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
>
> OK, doh!

Thanks, this one works ;)  I'll kick off a run with this replacing zdelta, 
should be around 3 hours.  For my small tree run with 300 patches, its faster 
than zdelta with about the same space savings.

-chris

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 16:35 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.61.0505031153550.32767@cag.csail.mit.edu>

On Tuesday 03 May 2005 11:57, C. Scott Ananian wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > your delta generator later this week.  Some quick and dirty space numbers
> > to show why we need to pack the files together:
>
> Are you accurately accounting for the cost of the extra hard/soft links
> your scheme requires?  Ie the directories get larger, lookups take
> slightly longer, etc.  Also access to a given file takes longer, and the
> deltas are referring to *other* packed files which *also* take longer to
> decompress and access...

My patch doesn't create any extra directory entries because the file for the 
packed file is unlinked after all the hard links are made.  Even if I kept 
the packed file directory entry, I'd adding one directory entry and saving an 
average 6-7 inodes per commit.

>
> How much better does delta-fication do, compared to just packing?

The best case for just packing is to pack the blobs, trees and commits all 
into one object.  Doing all three brought the tree down from 2.5GB to 1.57GB.  

The delta patch does pack trees together, but not into the same file as the 
blobs, and commits are not packed at all.  This is just because it is a pain 
to carry those changes around; it'll be easy to do later.

With the delta patch, the tree is around 900MB, I estimate packing the commits 
and trees into the blob files would save another 200MB.

Because space savings is so tightly coupled with packing ratios, a script to 
repack blobs, trees and commits from multiple commits will give much better 
compression.  Right now  the patch does not delta trees or commits, but it 
might make sense to delta the trees via the repacking script.

-chris

^ permalink raw reply

* Re: questions about cg-update, cg-pull, and cg-clone.
From: Daniel Barkalow @ 2005-05-03 16:30 UTC (permalink / raw)
  To: Zack Brown; +Cc: Petr Baudis, Git Mailing List
In-Reply-To: <20050503152214.GA1704@tumblerings.org>

On Tue, 3 May 2005, Zack Brown wrote:

> So, suppose I'm working on your Cogito HEAD. I make some changes to my local
> tree and commit them to my tree, and then before I go forward, I want to grab
> whatever you've done recently, to make sure we're not in conflict before I add
> new changes. If I understand you right, this situation would be a 'fast forward
> merge'. So what is the command I give to just 'merge' your HEAD with mine,
> without requiring a changelog entry?

In this case, you have to do a tree merge, because you have some commits
and he has some commits, and you want to be in a state where you have your
commits and his; this state is new, so you need a new commit with both
lines as parents.

> Alternatively, suppose I'm you, the project lead, and Zackdude has some
> changes for me, based on my HEAD. I want to 'merge' his tree into mine. If
> I'm still understanding you, this is a 'tree merge'. Now I give a cg-update,
> and now I *want* to give a changelog entry to record the merge.  Correct?

In this case, you don't have any commits that the other guy doesn't
have. Zackdude took your tree, made some changes, and that's his
head. Your head is still the same. He's already specified what happens
when you go from your head to his head; that's what he did, so the answer
has to be his head. That's a fast-forward.

Now, if the project lead decided to update from a second contributor who
hadn't rebased their contribution on the new head, then a merge is
required, to resolve the potential conflicts, and this merge needs a
commit.

> No, I still don't see it. I don't see why I would want to add an additional
> changelog entry on top of whatever changelog entries Zackdude has made himself.
> It just seems to pollute the changelog with entries that are essentially
> meaningless. When I read back over the logs, I'm not going to be interested in
> the bookkeeping of when I merged with various developers, I'm going to be
> interested in what those developers actually did to the code, and what *I*
> actually did to the code.

If developer A's changes work, and developer B's changes work, but they
don't work in your merge of them, you'll want to see that. Furthermore,
without a commit with both of their commits as parents, you can't reach
both of their histories from anywhere.

> OK, I don't understand this either. What is the difference between fetching the
> stuff and merging the stuff? Suppose I am working on a local repo of Cogito
> HEAD. I make some changes, commit them, and then I do a cg-pull. What happens?
> Are my changes overwritten? Do they show up at all? Do they exist in some
> nebulous ether that I will never see until I do a merge?

If you do a "cg-pull pasky", this doesn't change any of your stuff, but it
means that "cg-diff -r pasky" will now compare against his new head,
rather than the head he had when you previously did stuff. "cg-log
pasky" will include the new messages, and so forth. Also, you can then do
the merge without a network connection; you can pull overnight and merge
on the train.

You don't see anything different in your working directory, but your
repository essentially "knows more".

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Bill Davidsen @ 2005-05-03 16:22 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>,
	Linus Torvalds, Ryan Anderson, Andrea Arcangeli, linux-kernel,
	git
In-Reply-To: <20050503012921.GD22038@waste.org>

Matt Mackall wrote:
> On Tue, May 03, 2005 at 03:16:26AM +0200, Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org> wrote:
> 
>>Linus Torvalds <torvalds@osdl.org> wrote:
>>
>>>On Mon, 2 May 2005, Ryan Anderson wrote:
>>>
>>>>On Mon, May 02, 2005 at 09:31:06AM -0700, Linus Torvalds wrote:
>>
>>>>>That said, I think the /usr/bin/env trick is stupid too. It may be more
>>>>>portable for various Linux distributions, but if you want _true_
>>>>>portability, you use /bin/sh, and you do something like
>>>>>
>>>>>#!/bin/sh
>>>>>exec perl perlscript.pl "$@"
>>>>
>>>>if 0;
>>
>>exec may fail.
>>
>>#!/bin/sh
>>exec perl -x $0 ${1+"$@"} || exit 127
>>#!perl
>>
>>
>>>>You don't really want Perl to get itself into an exec loop.
>>>
>>>This would _not_ be "perlscript.pl" itself. This is the shell-script, and
>>>it's not called ".pl".
>>
>>In this thread, it originally was.
> 
> 
> In this thread, it was originally a Python script. In particular, one
> aimed at managing the Linux kernel source. I'm going to use
> /usr/bin/env, systems where that doesn't exist can edit the source.

On the theory that my first post got lost, why use /usr/bin/env at all, 
when bash already does that substitution? To support people who use 
other shells?

ie.:
    FOO=xx perl -e '$a=$ENV{FOO}; print "$a\n"'
-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 16:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.58.0505030804170.3594@ppc970.osdl.org>

On Tuesday 03 May 2005 11:07, Linus Torvalds wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > On the full import of all the bk->cvs changesets, the average file size
> > in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.
>
> Have you checked how many of those are blobs?
>
I've got cg-admin-lsobj running (effectively find .git -type f | xargs 
cat-file), it is taking a looong time but the ratios seem to stay pretty 
constant as it makes progress:

total: 186863
blob: 93688     (6.6 per commit)
commit: 14172
tree: 79003      (5.5 per commit)

> For many commits, we generate as many (or more) _tree_ objects as we
> generate blobs.
>
> And tree obejcts from the same "supertree" really is something that I
> wouldn't mind packing some way, because they really tend to be very much
> related (since they refer to each other). Eg the commit and the top-level
> tree are almost always a pair, since you'd get a shared top-level tree
> only with two commits that have the exact same content (which definitely
> happens, don't get me wrong, but it we get some duplication for that case,
> we'd still be winning).
>

The packed item patch wouldn't duplicate info in this case.  When it initially 
creates the packed buffer (before compression), it checks for an existing 
file with the same sha1 and returns if one is found.  This is to preserve the 
optimizations for write_tree case where it frequently tries to create files 
that already exist.

-chris

^ permalink raw reply

* Re: questions about cg-update, cg-pull, and cg-clone.
From: Joel Becker @ 2005-05-03 15:59 UTC (permalink / raw)
  To: Zack Brown; +Cc: Petr Baudis, Git Mailing List
In-Reply-To: <20050503152214.GA1704@tumblerings.org>

On Tue, May 03, 2005 at 08:22:15AM -0700, Zack Brown wrote:
> So, suppose I'm working on your Cogito HEAD. I make some changes to my local
> tree and commit them to my tree, and then before I go forward, I want to grab
> whatever you've done recently, to make sure we're not in conflict before I add
> new changes. If I understand you right, this situation would be a 'fast forward
> merge'. So what is the command I give to just 'merge' your HEAD with mine,
> without requiring a changelog entry?

	Remember that HEAD is merely a SHA1 of the toplevel tree object.
Imagine you have the simplest tree, one directory containing one file.
The file has the has hash aaaaaa.  The tree object containing it has the
hash bbbbbb.  So, HEAD contains bbbbbb.
	Now you update from Petr, having made no changes.  You pull his
newest tree, which also has a new file.  That new file has the hash
cccccc.  The new tree object, containing both files, now has the hash
dddddd.  HEAD now contains dddddd.  As you are in a matching state to
his tree, you have not done anything interesting to your tree, and there
is no commit.  This is a "fast-forward" merge.
	Then you change the first file, adding a few functions.  You
commit it, and it now has the hash 111111.  This change means the tree
hash becomes 222222.  So, HEAD contains 222222.
	You then update from Petr again.  He's changed the second file.
It's hash is no longer cccccc, it's eeeeee.  In his tree, the hash of
the tree is 333333 (from file 1's aaaaaa and file 2's eeeeee).  But the
hash of your tree is 444444 (from your local file 1's 111111 and file 2's eeeeee).  So, the hash of the your tree becomes 444444.  Your HEAD contains 444444.
This does _not_ match his 333333 HEAD.  You are committing the
combination of his change and yours.  He is saying that this work, which
may have required hand-merging or commit resolution, is "interesting"
information.

Joel

-- 

Life's Little Instruction Book #69

	"Whistle"

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: C. Scott Ananian @ 2005-05-03 15:57 UTC (permalink / raw)
  To: Chris Mason; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git
In-Reply-To: <200505030724.57827.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:

> your delta generator later this week.  Some quick and dirty space numbers to
> show why we need to pack the files together:

Are you accurately accounting for the cost of the extra hard/soft links 
your scheme requires?  Ie the directories get larger, lookups take 
slightly longer, etc.  Also access to a given file takes longer, and the 
deltas are referring to *other* packed files which *also* take longer to 
decompress and access...

How much better does delta-fication do, compared to just packing?
  --scott

NSA FJDEFLECT radar WASHTUB justice LCFLUTTER KUCLUB PBHISTORY Ft. Bragg 
ammunition immediate ESMERALDITE DC terrorist C4 SLBM affinity group
                          ( http://cscott.net/ )

^ permalink raw reply

* Re: RFC: adding xdelta compression to git
From: C. Scott Ananian @ 2005-05-03 15:52 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.58.0505022215110.21733@bigblue.dev.mdolabs.com>

On Mon, 2 May 2005, Davide Libenzi wrote:

> On Mon, 2 May 2005, Linus Torvalds wrote:
>
>> Yes. EXCEPT for one thing. fsck. I'd _really_ like fsck to be able to know
>> something about any xdelta objects, if only because if/when things go

> Linus, xdelta-based algorithms already stores informations regarding the
> object that originated the diff. Since they have no context (like
> text-based diffs) and are simply based on offset-driven copy/insert
> operations, this is a requirement. Libxdiff uses an adler32+size of the
> original object, but you can get as fancy as you like in your own
> implementation. Before a delta patching, the stored information are cross
> checked with the input base object, and the delta patch will fail in the
> eventuality of mismatch. So an fsck is simply a walk backward (or forward,
> depending on your metadata model) of the whole delta chain.

Linus knows this.  His point is just to be sure you actually *code* that 
walk in fsck, and (hopefully) do so w/o complicating the fsck too much.
  --scott

supercomputer BOND quiche SYNCARP Honduras North Korea Qaddafi PANCHO 
SKILLET KUDESK non-violent protest ESQUIRE struggle Saddam Hussein
                          ( http://cscott.net/ )

^ permalink raw reply

* [PATCH] add the ability to create and retrieve delta objects
From: Nicolas Pitre @ 2005-05-03 15:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.58.0505030742330.3594@ppc970.osdl.org>

On Tue, 3 May 2005, Linus Torvalds wrote:

> On Tue, 3 May 2005, Nicolas Pitre wrote:
> > 
> > Yep, that's what I've done last weekend (and just made it actually 
> > work since people are getting interested).
> 
> I have to say that it looks uncommonly simple. Also, afaik, this should
> still work with the current fsck, it's just that because fsck doesn't
> understand the linkages, the error reporting won't be as good as it could
> be (I'd _much_ rather see "delta failed in object xxxxx" than "unable to
> read xxxxxx").

Yep.  Let's do it in a separate patch if you please.

> Now, one thing I like about this approach is that the actual delta 
> _generation_ can be done off-line, and independently of anything else. 
> Which means that the performance paths I care about (commit etc) are 
> largely unaffected, and you can "deltify" a git archive overnight or 
> something. 

Yes.  And actually you can use any kind of delta reference topology as 
you wish.  It may start from the first object revision and the next 
revision is a delta against the first, the third a delta against the 
second, etc.  But it is much more interesting to do it the other way 
around, such that the second revision is stored as is and the first 
revision is made a delta against the second revision.  Then on the next 
commit the third revision is stored as is and the second rev made a 
delta against the third, and so on.  You therefore get delta compression 
at commit time with little overhead if you wish to do that.  And this 
approach has the advantage of keeping the latest object revisions fast 
accessible and the delta overhead is relegated to the old historic 
objects.

And suppose the delta chain is too deep for some objects and accessing 
them gets too much overhead.  No problem: just pick a random object in 
the middle of the delta chain and swap it with its original undeltafied 
version and the delta chain is now cut in two.

Etc.  It's flexible and open to any arrangement.

OK, here's a revised patch correcting the little bug found by
Chris Mason.

==========

This patch adds the necessary functionalities to perform delta
compression on objects.  It adds a git-mkdelta command which can replace
any object with its deltafied version given a reference object.

Access to a delta object will transparently fetch the reference object
and apply the transformation.  Scripts can be used to perform any sort
of compression policy on top of it.

The delta generator has been extracted from libxdiff and optimized for
git usage in order to avoid as much data copy as possible, and the delta
storage format modified to be even more compact.  Therefore no need to
rely on any external library.  The test-delta program can be used to
test it.

Many refinements are needed but better merge them separately.  Loop 
detection and recursion treshold are a few examples.

Signed-off-by: Nicolas Pitre <nico@cam.org>

--- a/Makefile
+++ b/Makefile
@@ -29,7 +29,7 @@ install: $(PROG) $(SCRIPTS)
 	install $(PROG) $(SCRIPTS) $(HOME)/bin/
 
 LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o \
-	 tag.o date.o
+	 tag.o date.o diff-delta.o patch-delta.o
 LIB_FILE=libgit.a
 LIB_H=cache.h object.h blob.h tree.h commit.h tag.h
 
@@ -63,6 +63,9 @@ $(LIB_FILE): $(LIB_OBJS)
 test-date: test-date.c date.o
 	$(CC) $(CFLAGS) -o $@ test-date.c date.o
 
+test-delta: test-delta.c diff-delta.o patch-delta.o
+	$(CC) $(CFLAGS) -o $@ $^
+
 git-%: %.c $(LIB_FILE)
 	$(CC) $(CFLAGS) -o $@ $(filter %.c,$^) $(LIBS)
 
@@ -92,6 +95,7 @@ git-rpush: rsh.c
 git-rpull: rsh.c pull.c
 git-rev-list: rev-list.c
 git-mktag: mktag.c
+git-mkdelta: mkdelta.c
 git-diff-tree-helper: diff-tree-helper.c
 git-tar-tree: tar-tree.c
 git-write-blob: write-blob.c
Created: delta.h (mode:100644)
--- /dev/null
+++ b/delta.h
@@ -0,0 +1,6 @@
+extern void *diff_delta(void *from_buf, unsigned long from_size,
+			void *to_buf, unsigned long to_size,
+		        unsigned long *delta_size);
+extern void *patch_delta(void *src_buf, unsigned long src_size,
+			 void *delta_buf, unsigned long delta_size,
+			 unsigned long *dst_size);
Created: diff-delta.c (mode:100644)
--- /dev/null
+++ b/diff-delta.c
@@ -0,0 +1,315 @@
+/*
+ * diff-delta.c: generate a delta between two buffers
+ *
+ *  Many parts of this file have been lifted from LibXDiff version 0.10.
+ *  http://www.xmailserver.org/xdiff-lib.html
+ *
+ *  LibXDiff was written by Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2003	Davide Libenzi
+ *
+ *  Many mods for GIT usage by Nicolas Pitre <nico@cam.org>, (C) 2005.
+ *
+ *  This file is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ */
+
+#include <stdlib.h>
+#include "delta.h"
+
+
+/* block size: min = 16, max = 64k, power of 2 */
+#define BLK_SIZE 16
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+#define GR_PRIME 0x9e370001
+#define HASH(v, b) (((unsigned int)(v) * GR_PRIME) >> (32 - (b)))
+	
+/* largest prime smaller than 65536 */
+#define BASE 65521
+
+/* NMAX is the largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1 */
+#define NMAX 5552
+
+#define DO1(buf, i)  { s1 += buf[i]; s2 += s1; }
+#define DO2(buf, i)  DO1(buf, i); DO1(buf, i + 1);
+#define DO4(buf, i)  DO2(buf, i); DO2(buf, i + 2);
+#define DO8(buf, i)  DO4(buf, i); DO4(buf, i + 4);
+#define DO16(buf)    DO8(buf, 0); DO8(buf, 8);
+
+static unsigned int adler32(unsigned int adler, const unsigned char *buf, int len)
+{
+	int k;
+	unsigned int s1 = adler & 0xffff;
+	unsigned int s2 = adler >> 16;
+
+	while (len > 0) {
+		k = MIN(len, NMAX);
+		len -= k;
+		while (k >= 16) {
+			DO16(buf);
+			buf += 16;
+			k -= 16;
+		}
+		if (k != 0)
+			do {
+				s1 += *buf++;
+				s2 += s1;
+			} while (--k);
+		s1 %= BASE;
+		s2 %= BASE;
+	}
+
+	return (s2 << 16) | s1;
+}
+
+static unsigned int hashbits(unsigned int size)
+{
+	unsigned int val = 1, bits = 0;
+	while (val < size && bits < 32) {
+		val <<= 1;
+	       	bits++;
+	}
+	return bits ? bits: 1;
+}
+
+typedef struct s_chanode {
+	struct s_chanode *next;
+	int icurr;
+} chanode_t;
+
+typedef struct s_chastore {
+	chanode_t *head, *tail;
+	int isize, nsize;
+	chanode_t *ancur;
+	chanode_t *sncur;
+	int scurr;
+} chastore_t;
+
+static void cha_init(chastore_t *cha, int isize, int icount)
+{
+	cha->head = cha->tail = NULL;
+	cha->isize = isize;
+	cha->nsize = icount * isize;
+	cha->ancur = cha->sncur = NULL;
+	cha->scurr = 0;
+}
+
+static void *cha_alloc(chastore_t *cha)
+{
+	chanode_t *ancur;
+	void *data;
+
+	ancur = cha->ancur;
+	if (!ancur || ancur->icurr == cha->nsize) {
+		ancur = malloc(sizeof(chanode_t) + cha->nsize);
+		if (!ancur)
+			return NULL;
+		ancur->icurr = 0;
+		ancur->next = NULL;
+		if (cha->tail)
+			cha->tail->next = ancur;
+		if (!cha->head)
+			cha->head = ancur;
+		cha->tail = ancur;
+		cha->ancur = ancur;
+	}
+
+	data = (void *)ancur + sizeof(chanode_t) + ancur->icurr;
+	ancur->icurr += cha->isize;
+	return data;
+}
+
+static void cha_free(chastore_t *cha)
+{
+	chanode_t *cur = cha->head;
+	while (cur) {
+		chanode_t *tmp = cur;
+		cur = cur->next;
+		free(tmp);
+	}
+}
+
+typedef struct s_bdrecord {
+	struct s_bdrecord *next;
+	unsigned int fp;
+	const unsigned char *ptr;
+} bdrecord_t;
+
+typedef struct s_bdfile {
+	const unsigned char *data, *top;
+	chastore_t cha;
+	unsigned int fphbits;
+	bdrecord_t **fphash;
+} bdfile_t;
+
+static int delta_prepare(const unsigned char *buf, int bufsize, bdfile_t *bdf)
+{
+	unsigned int fphbits;
+	int i, hsize;
+	const unsigned char *base, *data, *top;
+	bdrecord_t *brec;
+	bdrecord_t **fphash;
+
+	fphbits = hashbits(bufsize / BLK_SIZE + 1);
+	hsize = 1 << fphbits;
+	fphash = malloc(hsize * sizeof(bdrecord_t *));
+	if (!fphash)
+		return -1;
+	for (i = 0; i < hsize; i++)
+		fphash[i] = NULL;
+	cha_init(&bdf->cha, sizeof(bdrecord_t), hsize / 4 + 1);
+
+	bdf->data = data = base = buf;
+	bdf->top = top = buf + bufsize;
+	data += (bufsize / BLK_SIZE) * BLK_SIZE;
+	if (data == top)
+		data -= BLK_SIZE;
+
+	for ( ; data >= base; data -= BLK_SIZE) {
+		brec = cha_alloc(&bdf->cha);
+		if (!brec) {
+			cha_free(&bdf->cha);
+			free(fphash);
+			return -1;
+		}
+		brec->fp = adler32(0, data, MIN(BLK_SIZE, top - data));
+		brec->ptr = data;
+		i = HASH(brec->fp, fphbits);
+		brec->next = fphash[i];
+		fphash[i] = brec;
+	}
+
+	bdf->fphbits = fphbits;
+	bdf->fphash = fphash;
+
+	return 0;
+}
+
+static void delta_cleanup(bdfile_t *bdf)
+{
+	free(bdf->fphash);
+	cha_free(&bdf->cha);
+}
+
+#define COPYOP_SIZE(o, s) \
+    (!!(o & 0xff) + !!(o & 0xff00) + !!(o & 0xff0000) + !!(o & 0xff000000) + \
+     !!(s & 0xff) + !!(s & 0xff00) + 1)
+
+void *diff_delta(void *from_buf, unsigned long from_size,
+		 void *to_buf, unsigned long to_size,
+		 unsigned long *delta_size)
+{
+	int i, outpos, outsize, inscnt, csize, msize, moff;
+	unsigned int fp;
+	const unsigned char *data, *top, *ptr1, *ptr2;
+	unsigned char *out, *orig;
+	bdrecord_t *brec;
+	bdfile_t bdf;
+
+	if (delta_prepare(from_buf, from_size, &bdf))
+		return NULL;
+	
+	outpos = 0;
+	outsize = 4096;
+	out = malloc(outsize);
+	if (!out) {
+		delta_cleanup(&bdf);
+		return NULL;
+	}
+
+	data = to_buf;
+	top = to_buf + to_size;
+
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size;
+
+	inscnt = 0;
+	moff = 0;
+	while (data < top) {
+		msize = 0;
+		fp = adler32(0, data, MIN(top - data, BLK_SIZE));
+		i = HASH(fp, bdf.fphbits);
+		for (brec = bdf.fphash[i]; brec; brec = brec->next) {
+			if (brec->fp == fp) {
+				csize = bdf.top - brec->ptr;
+				if (csize > top - data)
+					csize = top - data;
+				for (ptr1 = brec->ptr, ptr2 = data; 
+				     csize && *ptr1 == *ptr2;
+				     csize--, ptr1++, ptr2++);
+
+				csize = ptr1 - brec->ptr;
+				if (csize > msize) {
+					moff = brec->ptr - bdf.data;
+					msize = csize;
+					if (msize >= 0x10000) {
+						msize = 0x10000;
+						break;
+					}
+				}
+			}
+		}
+
+		if (!msize || msize < COPYOP_SIZE(moff, msize)) {
+			if (!inscnt)
+				outpos++;
+			out[outpos++] = *data++;
+			inscnt++;
+			if (inscnt == 0x7f) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+		} else {
+			if (inscnt) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+
+			data += msize;
+			orig = out + outpos++;
+			i = 0x80;
+
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
+
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x10; }
+			msize >>= 8;
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x20; }
+
+			*orig = i;
+		}
+
+		/* next time around the largest possible output is 1 + 4 + 3 */
+		if (outpos > outsize - 8) {
+			void *tmp = out;
+			outsize = outsize * 3 / 2;
+			out = realloc(out, outsize);
+			if (!out) {
+				free(tmp);
+				delta_cleanup(&bdf);
+				return NULL;
+			}
+		}
+	}
+
+	if (inscnt)
+		out[outpos - inscnt - 1] = inscnt;
+
+	delta_cleanup(&bdf);
+	*delta_size = outpos;
+	return out;
+}
Created: patch-delta.c (mode:100644)
--- /dev/null
+++ b/patch-delta.c
@@ -0,0 +1,73 @@
+/*
+ * patch-delta.c:
+ * recreate a buffer from a source and the delta produced by diff-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdlib.h>
+#include <string.h>
+#include "delta.h"
+
+void *patch_delta(void *src_buf, unsigned long src_size,
+		  void *delta_buf, unsigned long delta_size,
+		  unsigned long *dst_size)
+{
+	const unsigned char *data, *top;
+	unsigned char *dst, *out;
+	int size;
+
+	/* the smallest delta size possible is 10 bytes */
+	if (delta_size < 10)
+		return NULL;
+
+	data = delta_buf;
+	top = delta_buf + delta_size;
+
+	/* make sure the orig file size matches what we expect */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	if (size != src_size)
+		return NULL;
+
+	/* now the result size */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	dst = malloc(size);
+	if (!dst)
+		return NULL;
+
+	out = dst;
+	while (data < top) {
+		unsigned char cmd = *data++;
+		if (cmd & 0x80) {
+			unsigned int cp_off = 0, cp_size = 0;
+			if (cmd & 0x01) cp_off = *data++;
+			if (cmd & 0x02) cp_off |= (*data++ << 8);
+			if (cmd & 0x04) cp_off |= (*data++ << 16);
+			if (cmd & 0x08) cp_off |= (*data++ << 24);
+			if (cmd & 0x10) cp_size = *data++;
+			if (cmd & 0x20) cp_size |= (*data++ << 8);
+			if (cp_size == 0) cp_size = 0x10000;
+			memcpy(out, src_buf + cp_off, cp_size);
+			out += cp_size;
+		} else {
+			memcpy(out, data, cmd);
+			out += cmd;
+			data += cmd;
+		}
+	}
+
+	/* sanity check */
+	if (data != top || out - dst != size) {
+		free(dst);
+		return NULL;
+	}
+
+	*dst_size = size;
+	return dst;
+}
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -8,6 +8,7 @@
  */
 #include <stdarg.h>
 #include "cache.h"
+#include "delta.h"
 
 const char *sha1_file_directory = NULL;
 
@@ -186,7 +187,8 @@ void * unpack_sha1_file(void *map, unsig
 	int ret, bytes;
 	z_stream stream;
 	char buffer[8192];
-	char *buf;
+	char *buf, *delta_ref;
+	unsigned long delta_ref_sz;
 
 	/* Get the data stream */
 	memset(&stream, 0, sizeof(stream));
@@ -201,8 +203,15 @@ void * unpack_sha1_file(void *map, unsig
 		return NULL;
 	if (sscanf(buffer, "%10s %lu", type, size) != 2)
 		return NULL;
-
 	bytes = strlen(buffer) + 1;
+
+	if (!strcmp(type, "delta")) {
+		delta_ref = read_sha1_file(buffer + bytes, type, &delta_ref_sz);
+		if (!delta_ref)
+			return NULL;
+	} else
+		delta_ref = NULL;
+
 	buf = xmalloc(*size);
 
 	memcpy(buf, buffer + bytes, stream.total_out - bytes);
@@ -214,6 +223,17 @@ void * unpack_sha1_file(void *map, unsig
 			/* nothing */;
 	}
 	inflateEnd(&stream);
+
+	if (delta_ref) {
+		char *newbuf;
+		unsigned long newsize;
+		newbuf = patch_delta(delta_ref, delta_ref_sz, buf+20, *size-20, &newsize);
+		free(delta_ref);
+		free(buf);
+		buf = newbuf;
+		*size = newsize;
+	}
+
 	return buf;
 }
 
Created: test-delta.c (mode:100644)
--- /dev/null
+++ b/test-delta.c
@@ -0,0 +1,79 @@
+/*
+ * test-delta.c: test code to exercise diff-delta.c and patch-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include "delta.h"
+
+static const char *usage =
+	"test-delta (-d|-p) <from_file> <data_file> <out_file>";
+
+int main(int argc, char *argv[])
+{
+	int fd;
+	struct stat st;
+	void *from_buf, *data_buf, *out_buf;
+	unsigned long from_size, data_size, out_size;
+
+	if (argc != 5 || (strcmp(argv[1], "-d") && strcmp(argv[1], "-p"))) {
+		fprintf(stderr, "Usage: %s\n", usage);
+		return 1;
+	}
+
+	fd = open(argv[2], O_RDONLY);
+	if (fd < 0 || fstat(fd, &st)) {
+		perror(argv[2]);
+		return 1;
+	}
+	from_size = st.st_size;
+	from_buf = mmap(NULL, from_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (from_buf == MAP_FAILED) {
+		perror(argv[2]);
+		return 1;
+	}
+	close(fd);
+
+	fd = open(argv[3], O_RDONLY);
+	if (fd < 0 || fstat(fd, &st)) {
+		perror(argv[3]);
+		return 1;
+	}
+	data_size = st.st_size;
+	data_buf = mmap(NULL, data_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (data_buf == MAP_FAILED) {
+		perror(argv[3]);
+		return 1;
+	}
+	close(fd);
+
+	if (argv[1][1] == 'd')
+		out_buf = diff_delta(from_buf, from_size,
+				     data_buf, data_size, &out_size);
+	else
+		out_buf = patch_delta(from_buf, from_size,
+				      data_buf, data_size, &out_size);
+	if (!out_buf) {
+		fprintf(stderr, "delta operation failed (returned NULL)\n");
+		return 1;
+	}
+
+	fd = open (argv[4], O_WRONLY|O_CREAT|O_TRUNC, 0666);
+	if (fd < 0 || write(fd, out_buf, out_size) != out_size) {
+		perror(argv[4]);
+		return 1;
+	}
+
+	return 0;
+}

^ permalink raw reply

* Re: RFC: adding xdelta compression to git
From: C. Scott Ananian @ 2005-05-03 15:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alon Ziv, git
In-Reply-To: <Pine.LNX.4.58.0505022131380.3594@ppc970.osdl.org>

On Mon, 2 May 2005, Linus Torvalds wrote:

>> * Changes the repository format.
>
> It wouldn't necessarily. You should be able to do this with _zero_ changes
> to existing objects what-so-ever.

Yes.  The 'chunking' code I posted earlier does this, etc.  It's kinda odd 
computing a SHA-1 including the 'blob <size>\0' header, even when your 
representation doesn't use this type exactly, but it's no big deal.  I'm 
still tinkering with this, btw; I can get modest improvements in 'real' disk 
space used, but nothing earth-shattering (yet).  I'll post the list of 
things I tried and how well they worked at some point, just to save people 
the effort of retrying things.

I've been working from the 'no knowledge of commit structure needed' 
perspective; I think Chris Mason has been using the structure of the 
commit object to guide delta-fication and showing more impressive 
space savings.
  --scott

HTAUTOMAT Legion of Doom payment PBPRIME insurgent shortwave AVBUSY 
Nader PBCABOOSE overthrow explosion Ortega STANDEL ECJOB Sigint FBI
                          ( http://cscott.net/ )

^ permalink raw reply

* Re: questions about cg-update, cg-pull, and cg-clone.
From: Zack Brown @ 2005-05-03 15:22 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Git Mailing List
In-Reply-To: <20050502195846.GC20818@pasky.ji.cz>

On Mon, May 02, 2005 at 09:58:46PM +0200, Petr Baudis wrote:
> Dear diary, on Sat, Apr 30, 2005 at 02:53:22AM CEST, I got a letter
> where Zack Brown <zbrown@tumblerings.org> told me that...
> > 'cg-update branch-name' grabs any new changes from the upstream repository and
> > merges them into my local repository. If I've been editing files in my local
> > repository, the update attempts to merge the changes cleanly.
> 
> Yes.
> 
> > Now, if the update is clean, a cg-commit is invoked automatically, and if the
> > update is not clean, I then have to resolve any conflicts and give the cg-commit
> > command by hand. But: what is the significance of either of these cg-commit
> > commands? Why should I have to write a changelog entry recording this merge? All
> 
> You might want to write some special notes regarding the merge, e.g.
> when you want to describe some non-trivial conflict resolution, or even
> give a short blurb of the changes you are merging.
> 
> If you don't know what to say, just press Ctrl-D. The first line of the
> commit always says "Merge with what_you_are_merging_with".
> 
> > I'm doing is updating my tree to be current. Why should I have to 'commit' that
> > update?
> 
> If you are only updating your tree to be current, you don't have to
> commit, and in fact you don't commit (you do so-called "fast-forward
> merge", which will just update your HEAD pointer to point at the newer
> commit). You commit only when you were merging stuff (so-called "tree
> merge"; well, that's at least how I call it to differentiate it from the
> fast-forward merge). That means you have some local commits over there -
> I can't just update your tree to be current, sorry. That would lose your
> commit. I have to merge the changes into your tree through a merge
> commit.

Hm.

So, suppose I'm working on your Cogito HEAD. I make some changes to my local
tree and commit them to my tree, and then before I go forward, I want to grab
whatever you've done recently, to make sure we're not in conflict before I add
new changes. If I understand you right, this situation would be a 'fast forward
merge'. So what is the command I give to just 'merge' your HEAD with mine,
without requiring a changelog entry?

Alternatively, suppose I'm you, the project lead, and Zackdude has some
changes for me, based on my HEAD. I want to 'merge' his tree into mine. If
I'm still understanding you, this is a 'tree merge'. Now I give a cg-update,
and now I *want* to give a changelog entry to record the merge.  Correct?

No, I still don't see it. I don't see why I would want to add an additional
changelog entry on top of whatever changelog entries Zackdude has made himself.
It just seems to pollute the changelog with entries that are essentially
meaningless. When I read back over the logs, I'm not going to be interested in
the bookkeeping of when I merged with various developers, I'm going to be
interested in what those developers actually did to the code, and what *I*
actually did to the code.

> 
> > Now I look at 'cg-pull'. What does this do? The readme says something about
> > printing two ids, and being useful for diffs. But can't I do a diff after a
> > cg-update and get the same result? I'm very confused about cg-pull right now.
> 
> cg-pull does the first part of cg-update. It is concerned by fetching
> the stuff from the remote repository to the local one. cg-merge then
> does the second part, merging the stuff to your local tree (doing either
> fast-forward or tree merge).

OK, I don't understand this either. What is the difference between fetching the
stuff and merging the stuff? Suppose I am working on a local repo of Cogito
HEAD. I make some changes, commit them, and then I do a cg-pull. What happens?
Are my changes overwritten? Do they show up at all? Do they exist in some
nebulous ether that I will never see until I do a merge?

Be well,
Zack

> 
> -- 
> 				Petr "Pasky" Baudis
> Stuff: http://pasky.or.cz/
> C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Zack Brown

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Linus Torvalds @ 2005-05-03 15:07 UTC (permalink / raw)
  To: Chris Mason; +Cc: Nicolas Pitre, Alon Ziv, git
In-Reply-To: <200505030724.57827.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:
> 
> On the full import of all the bk->cvs changesets, the average file size 
> in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.

Have you checked how many of those are blobs?

For many commits, we generate as many (or more) _tree_ objects as we 
generate blobs. 

And tree obejcts from the same "supertree" really is something that I
wouldn't mind packing some way, because they really tend to be very much
related (since they refer to each other). Eg the commit and the top-level
tree are almost always a pair, since you'd get a shared top-level tree
only with two commits that have the exact same content (which definitely
happens, don't get me wrong, but it we get some duplication for that case,
we'd still be winning).

		Linus

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Nicolas Pitre @ 2005-05-03 15:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <200505031037.38005.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:

> On Tuesday 03 May 2005 10:24, Nicolas Pitre wrote:
> > On Tue, 3 May 2005, Chris Mason wrote:
> > > Hmmm, something is strange here, am I using this wrong?
> > >
> > > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> > > Aborted
> >
> > Can you send me your foo and delta2 files?
> >
> Sorry, thought I had the whole command history in there.  I went for something 
> small to start ;)
> 
> coffee:~/git/linus.orig # echo foo > foo
> coffee:~/git/linus.orig # echo foo2 > foo2
> coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> coffee:~/git/linus.orig # ls -la delta1
> -rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
> coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***

OK, doh!

--- diff-delta.c.orig	2005-05-03 11:00:39.900529634 -0400
+++ diff-delta.c	2005-05-03 11:01:03.210031176 -0400
@@ -307,7 +307,7 @@
 	}
 
 	if (inscnt)
-		out[-inscnt - 1] = inscnt;
+		out[outpos - inscnt - 1] = inscnt;
 
 	delta_cleanup(&bdf);
 	*delta_size = outpos;

^ permalink raw reply

* Re: More problems...
From: Andreas Gal @ 2005-05-03 15:00 UTC (permalink / raw)
  To: Petr Baudis
  Cc: Linus Torvalds, Anton Altaparmakov, Russell King, Junio C Hamano,
	Ryan Anderson, git
In-Reply-To: <20050503014816.GQ20818@pasky.ji.cz>


I am just soft-linking objects/ in the branched tree. I can live with 
dangling objects, branching is extremly fast, and diskspace is cheap 
anyway. The only downside is that it doesn't work too well with rsync as 
network protocol, but I use only http-pull and rpush anyway.

Andreas

On Tue, 3 May 2005, Petr Baudis wrote:

> Dear diary, on Tue, May 03, 2005 at 12:19:16AM CEST, I got a letter
> where Linus Torvalds <torvalds@osdl.org> told me that...
> > But for "normal" situations, where you have a tree or two, the hardlinking 
> > win might not be big enough to warrant the maintenance headache. With 
> > hardlinking, you _do_ need to "trust" the other trees to some degree.
> 
> As long as the trees aren't yours and you aren't doing something really
> horrible with them...
> 
> $ time git-local-pull -a -l $(cat ~/git-devel/.git/HEAD) ~/git-devel/.git/
> real    0m0.332s
> 
> $ time git-local-pull -a $(cat ~/git-devel/.git/HEAD) ~/git-devel/.git/
> real    0m4.306s
> 
> And this is only 13M Cogito objects database. I think one of the
> important things is to encourage branching, therefore it must be fast
> enough; that's why I really wanted to do hardlinks. The disk space is
> important, but the speed hit probably equally (if not more) so.
> 
> BTW, the object database files should have 0444 or such; they really
> _are_ read-only and making them so mode-wise could help against some
> mistakes too.
> 
> It's clear that Cogito should have a way to choose whether to hardlink
> or copy; the question is which one should be the default one and how
> should it be specified.  I thought about using file:// vs. just local
> path to differentiate between copy and hardlinking, but that'd be
> totally non-obvious, therefore bad UI-wise.
> 
> BTW, I've just committed support for pulling from remote repositories
> over the HTTP and SSH protocols (http://your.git/repo,
> git+ssh://root@git.nasa.gov/srv/git/mars) (note that I was unable to
> test the SSH stuff properly now; success reports or patches welcome).
> Also, the local hardlinking access is now done over git-local-pull,
> therefore the cp errors should go away now.
> 
> I'm not yet decided whether locations like
> 
> 	kernel.org:/pub/scm/cogito/cogito.git
> 
> should invoke rsync, rpull, throw an error or print a fortune cookie.
> 
> -- 
> 				Petr "Pasky" Baudis
> Stuff: http://pasky.or.cz/
> C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Linus Torvalds @ 2005-05-03 14:48 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Alon Ziv, Git Mailing List
In-Reply-To: <Pine.LNX.4.62.0505030344170.14033@localhost.localdomain>

On Tue, 3 May 2005, Nicolas Pitre wrote:
> 
> Yep, that's what I've done last weekend (and just made it actually 
> work since people are getting interested).

I have to say that it looks uncommonly simple. Also, afaik, this should
still work with the current fsck, it's just that because fsck doesn't
understand the linkages, the error reporting won't be as good as it could
be (I'd _much_ rather see "delta failed in object xxxxx" than "unable to
read xxxxxx").

Now, one thing I like about this approach is that the actual delta 
_generation_ can be done off-line, and independently of anything else. 
Which means that the performance paths I care about (commit etc) are 
largely unaffected, and you can "deltify" a git archive overnight or 
something. 

In fact, it means that you might even be able to use some fairly expensive 
"search for the best blob object to delta against", including very much a 
intelligent rename search (ie "oh, this is a new object, let's see if any 
of the old deleted objects generate a good delta"), but you might even go 
back more than one generation.

Hmm. How nasty are those scripts?

		Linus

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 14:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.62.0505031022340.14033@localhost.localdomain>

On Tuesday 03 May 2005 10:24, Nicolas Pitre wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > Hmmm, something is strange here, am I using this wrong?
> >
> > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> > Aborted
>
> Can you send me your foo and delta2 files?
>
Sorry, thought I had the whole command history in there.  I went for something 
small to start ;)

coffee:~/git/linus.orig # echo foo > foo
coffee:~/git/linus.orig # echo foo2 > foo2
coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
coffee:~/git/linus.orig # ls -la delta1
-rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
*** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***

-chris

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Nicolas Pitre @ 2005-05-03 14:24 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <200505031013.57476.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:

> On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> > On Mon, 2 May 2005, Linus Torvalds wrote:
> > > If you do something like this, you want such a delta-blob to be named by
> > > the sha1 of the result, so that things that refer to it can transparently
> > > see either the original blob _or_ the "deltified" one, and will never
> > > care.
> >
> > Yep, that's what I've done last weekend (and just made it actually
> > work since people are getting interested).
> 
> Hmmm, something is strange here, am I using this wrong?
> 
> coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> Aborted

Can you send me your foo and delta2 files?


Nicolas

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Chris Mason @ 2005-05-03 14:13 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <Pine.LNX.4.62.0505030344170.14033@localhost.localdomain>

On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> On Mon, 2 May 2005, Linus Torvalds wrote:
> > If you do something like this, you want such a delta-blob to be named by
> > the sha1 of the result, so that things that refer to it can transparently
> > see either the original blob _or_ the "deltified" one, and will never
> > care.
>
> Yep, that's what I've done last weekend (and just made it actually
> work since people are getting interested).

Hmmm, something is strange here, am I using this wrong?

coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
*** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
Aborted

Valgrind output:

==9634== Invalid read of size 1
==9634==    at 0x1B9036F0: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906F is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036F3: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08D is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B9036F6: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906E is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036F9: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08C is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B9036FC: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906D is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036FF: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08B is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B903702: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906C is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B903708: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08A is not stack'd, malloc'd or (recently) free'd
delta operation failed (returned NULL)
==9634==
==9634== ERROR SUMMARY: 206 errors from 13 contexts (suppressed: 0 from 0)
==9634== malloc/free: in use at exit: 0 bytes in 0 blocks.
==9634== malloc/free: 1 allocs, 1 frees, 5 bytes allocated.
==9634== For a detailed leak analysis,  rerun with: --leak-check=yes
==9634== For counts of detected errors, rerun with: -v

-chris

^ permalink raw reply

* Re: [PATCH] add the ability to create and retrieve delta objects
From: Nicolas Pitre @ 2005-05-03 12:51 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git
In-Reply-To: <200505030724.57827.mason@suse.com>

On Tue, 3 May 2005, Chris Mason wrote:

> This looks much nicer than using zdelta, I'll try switching my packed item to 
> your delta generator later this week.  Some quick and dirty space numbers to 
> show why we need to pack the files together:
> 
> On the full import of all the bk->cvs changesets, the average file size 
> in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.
> 
> This means that of the 2.5GB the .git directory consumes, about 1GB is taken 
> up by files under 4k where deltas won't save space.  If the remaining files 
> could be delta compressed down to less than 4k, they would still take up 
> around 400MB on disk.

Sure.  However it helps for history backups and network transfer.

However if the delta compression and packed storage can remain as 
decoupled as possible from each other this is good for flexibility.


Nicolas

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox