Git development

Git development
 help / color / mirror / Atom feed

* Re: [RFH] Exploration of an alternative diff_delta() algorithm
From: Nicolas Pitre @ 2006-04-09 17:14 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <20060409143117.GA23908@erlang.gbar.dtu.dk>

On Sun, 9 Apr 2006, Peter Eriksen wrote:

> Greetings Gitlings,
> 
> I've been trying to implement an alternative algorithm
> for diff_delta().  I'm getting close to something that
> works, but now I'm stuck!  I think it has something to
> do with pack-objects.c, but I'm not sure.  Here's the
> first test that fails:
> 
> *** t5500-fetch-pack.sh ***
> * FAIL 1: 1st pull
>         git-fetch-pack -v .. B A > log.txt 2>&1
> * FAIL 2: fsck
>         git-fsck-objects --full > fsck.txt 2>&1
> * FAIL 3: new object count after 1st pull
>         test 33 = 0
> * FAIL 4: minimal count
>         test 33 = 0
> * FAIL 5: repack && prune-packed in client
>         (git-repack && git-prune-packed)2>>log.txt
> *   ok 5: 2nd pull
> *   ok 6: fsck
> * FAIL 7: new object count after 2nd pull
>         test 192 = 198
> * FAIL 8: minimal count
>         test 192 = 198
> * FAIL 9: repack && prune-packed in client
>         (git-repack && git-prune-packed)2>>log.txt
> *   ok 9: 3rd pull
> *   ok 10: fsck
> * FAIL 11: new object count after 3rd pull
>         test 3 = 228
> * FAIL 12: minimal count
>         test 3 = 30
> * failed 8 among 12 test(s)
> 
> I've been looking all around the current diff_delta(), and I
> can't see, what I'm missing.  Any ideas?  The file is meant to
> replace the current diff-delta.c.

Nothing outside diff-delta.c and patch-delta.c is aware of the delta 
data format.  So if your version is meant to be a transparent 
replacement then it should pass all tests.  If it doesn't then it is 
broken.

To help you play around you could try the test-delta utility (make 
test-delta to build it).

So:

	test-delta -d file1 file2 delta_file
	test-delta -p file1 delta_file file3
	cmp file2 file3

You should always have file3 identical to file2.


Nicolas

^ permalink raw reply

* Re: [RFH] Exploration of an alternative diff_delta() algorithm
From: Peter Eriksen @ 2006-04-09 17:34 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.64.0604091307460.2215@localhost.localdomain>

On Sun, Apr 09, 2006 at 01:14:31PM -0400, Nicolas Pitre wrote:
...
> Nothing outside diff-delta.c and patch-delta.c is aware of the delta 
> data format.  So if your version is meant to be a transparent 
> replacement then it should pass all tests.  If it doesn't then it is 
> broken.
> 
> To help you play around you could try the test-delta utility (make 
> test-delta to build it).
> 
> So:
> 
> 	test-delta -d file1 file2 delta_file
> 	test-delta -p file1 delta_file file3
> 	cmp file2 file3

My tests of these kinds doesn't show any errors.  Though, if file2 is
empty, test-delta writes: "file2: Invalid argument".

Peter

^ permalink raw reply

* Re: [RFH] Exploration of an alternative diff_delta() algorithm
From: Nicolas Pitre @ 2006-04-09 17:40 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604091307460.2215@localhost.localdomain>

On Sun, 9 Apr 2006, Nicolas Pitre wrote:

> On Sun, 9 Apr 2006, Peter Eriksen wrote:
> 
> > Greetings Gitlings,
> > 
> > I've been trying to implement an alternative algorithm
> > for diff_delta().  I'm getting close to something that
> > works, but now I'm stuck!
> 
> Nothing outside diff-delta.c and patch-delta.c is aware of the delta 
> data format.  So if your version is meant to be a transparent 
> replacement then it should pass all tests.  If it doesn't then it is 
> broken.
> 
> To help you play around you could try the test-delta utility (make 
> test-delta to build it).
> 
> So:
> 
> 	test-delta -d file1 file2 delta_file
> 	test-delta -p file1 delta_file file3
> 	cmp file2 file3
> 
> You should always have file3 identical to file2.

Out of curiosity I just tried your diff-delta version with test-delta 
and it produced a segmentation fault on the first attempt.

It also has lots of compilation warnings.


Nicolas

^ permalink raw reply

* Re: [RFH] Exploration of an alternative diff_delta() algorithm
From: Nicolas Pitre @ 2006-04-09 17:45 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <20060409173409.GB23908@erlang.gbar.dtu.dk>

On Sun, 9 Apr 2006, Peter Eriksen wrote:

> On Sun, Apr 09, 2006 at 01:14:31PM -0400, Nicolas Pitre wrote:
> ...
> > Nothing outside diff-delta.c and patch-delta.c is aware of the delta 
> > data format.  So if your version is meant to be a transparent 
> > replacement then it should pass all tests.  If it doesn't then it is 
> > broken.
> > 
> > To help you play around you could try the test-delta utility (make 
> > test-delta to build it).
> > 
> > So:
> > 
> > 	test-delta -d file1 file2 delta_file
> > 	test-delta -p file1 delta_file file3
> > 	cmp file2 file3
> 
> My tests of these kinds doesn't show any errors. 

Try this with the README file from the git source tree:

	sed s/git/GIT/g < ./README > /tmp/README.mod
	test-delta -d ./README /tmp/README.mod /tmp/README.delta
	[BOOM!]

> Though, if file2 is empty, test-delta writes: "file2: Invalid 
> argument".

We never delta against or towards empty files.


Nicolas

^ permalink raw reply

* Re: [RFH] Exploration of an alternative diff_delta() algorithm
From: Peter Eriksen @ 2006-04-09 17:53 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.64.0604091333140.2215@localhost.localdomain>

On Sun, Apr 09, 2006 at 01:40:14PM -0400, Nicolas Pitre wrote:
...
> Out of curiosity I just tried your diff-delta version with test-delta 
> and it produced a segmentation fault on the first attempt.

Yes, I get that too with your README example.

> It also has lots of compilation warnings.

Hm, I don't get any warnings.  Would you mind pasting them, so I
can see what it's about?

At least now I have one segmentation fault to work on.  
Thanks.

Peter

^ permalink raw reply

* Re: [RFH] Exploration of an alternative diff_delta() algorithm
From: Nicolas Pitre @ 2006-04-09 18:08 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <20060409175316.GA21455@erlang.gbar.dtu.dk>

On Sun, 9 Apr 2006, Peter Eriksen wrote:

> On Sun, Apr 09, 2006 at 01:40:14PM -0400, Nicolas Pitre wrote:
> ...
> > It also has lots of compilation warnings.
> 
> Hm, I don't get any warnings.  Would you mind pasting them, so I
> can see what it's about?

gcc -o diff-delta.o -c -g -O2 -Wall -DSHA1_HEADER='<openssl/sha.h>'  diff-delta.c
diff-delta.c: In function 'diff_delta':
diff-delta.c:123: warning: pointer targets in passing argument 1 of 'init_hash' differ in signedness
diff-delta.c:124: warning: pointer targets in passing argument 1 of 'init_hash' differ in signedness
diff-delta.c:170: warning: pointer targets in passing argument 1 of 'hash' differ in signedness
diff-delta.c:171: warning: pointer targets in passing argument 1 of 'hash' differ in signedness
diff-delta.c:203: warning: pointer targets in passing argument 1 of 'init_hash' differ in signedness
diff-delta.c:204: warning: pointer targets in passing argument 1 of 'init_hash' differ in signedness

Also you should avoid declaring new variables after code in the same 
scope, like you do with version_offset for example.  This is a feature 
that many C compilers don't support.


Nicolas

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Junio C Hamano @ 2006-04-09 18:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604090950590.9504@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> I wonder... This all looks fine, but there are actually two different 
> "diffs" that can be shown for "git log --diff <pathlimiter>":
>
>  - the whole diff for a commit
>  - the path-limited diff

Yes, exactly the same way sometimes you would want just pickaxe,
sometimes you would want it with --pickaxe-all.

Also, I might have to rethink --max-count logic -- I think it is
reasonable to skip the commit when doing limiting by diff like
"whatchanged" does, but one thing I find suboptimal with the
current whatchanged is that it does not count commits that are
actually shown (it counts what the upstream rev-list feeds
diff-tree).  With the "git log --diff" based whatchanged, it
becomes trivial to skip the revs->max_count limiting and have
the caller count the commits it actually does something
user-visible to, instead of counting the commits it pulled out
of get_revision().

BTW I think I could remove the log message generation part of
"git log" and have it use the one in log-tree (which I will
probably rewrite not to format the message into the static
this_header[] buffer when it is not shown).

Another thing that might be useful is to teach diff-* to do the
diffstat part internally.  After that is in place we could
introduce --pretty=patch to have "git log" produce format-patch
compatible output.

^ permalink raw reply

* git ident
From: Jeremy English @ 2006-04-09 18:48 UTC (permalink / raw)
  To: git

I keep a local project in a git archive.  After the last upgrade I get a 
ident error when trying to commit.  It works after I set the environment 
variables.  What I don't like is that the error comes up after I have 
typed in my comment, then my comment is lost, that's frustrating.  The 
other thing is I don't care if the commit is coming from a valid person, 
why require this?

^ permalink raw reply

* Re: git ident
From: Junio C Hamano @ 2006-04-09 19:01 UTC (permalink / raw)
  To: Jeremy English; +Cc: git
In-Reply-To: <44395711.7000902@jeremyenglish.org>

Jeremy English <jhe@jeremyenglish.org> writes:

> What I don't like is that the error comes up
> after I have typed in my comment, then my comment is lost, that's
> frustrating.

Sympathizable, but presumably a new user needs to be burned only
once (set them either in $HOME/.profile or .git/config if you
want to use separate identity per project).

> ....  The other thing is I don't care if the commit is coming
> from a valid person, why require this?

Because public projects like the kernel wants to prevent
otherwise good commits from a misconfigured repository to
propagate into them.  We could have a separate per-repository
configuration to say "broken identity is not a problem for this
project", but if the user has to set that in the configuration,
she would be better off setting her identity there.

And making it the default not to require the identity is going
backwards. Our primary focus is to support public, multi-person,
distributed development project.

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Linus Torvalds @ 2006-04-09 19:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vbqvabn8f.fsf@assigned-by-dhcp.cox.net>

On Sun, 9 Apr 2006, Junio C Hamano wrote:
> 
> Also, I might have to rethink --max-count logic -- I think it is
> reasonable to skip the commit when doing limiting by diff like
> "whatchanged" does, but one thing I find suboptimal with the
> current whatchanged is that it does not count commits that are
> actually shown (it counts what the upstream rev-list feeds
> diff-tree).  With the "git log --diff" based whatchanged, it
> becomes trivial to skip the revs->max_count limiting and have
> the caller count the commits it actually does something
> user-visible to, instead of counting the commits it pulled out
> of get_revision().

Well, on the other hand, the new "git log --diff" should get the revision 
counting right even if it's _not_ done by the caller.

Really, the only reason "git-whatchanged" exists at all is that it used to 
be originally impossible, and later on too expensive to do the commit- 
limiting by pathname. With the new incremental path-limiting, the reason 
for "git-whatchanged" simply goes away.

So I'd suggest:
 - drop git-whatchanged entirely
 - keep it - for historical reasons - as a internal shorthand, and just 
   turn it into "git log --diff -cc"

and everybody will be happy (yeah, it will show a few merge commits 
without diffs, because the diffs end up being uninteresting, but that's 
_fine_, even if it's not 100% the same thing git-whatchanged used to do)

			Linus

^ permalink raw reply

* Re: git ident
From: sean @ 2006-04-09 19:02 UTC (permalink / raw)
  To: Jeremy English; +Cc: git
In-Reply-To: <44395711.7000902@jeremyenglish.org>

On Sun, 09 Apr 2006 13:48:49 -0500
Jeremy English <jhe@jeremyenglish.org> wrote:

> I keep a local project in a git archive.  After the last upgrade I get a 
> ident error when trying to commit.  It works after I set the environment 
> variables.  What I don't like is that the error comes up after I have 
> typed in my comment, then my comment is lost, that's frustrating.  The 
> other thing is I don't care if the commit is coming from a valid person, 
> why require this?

Believe it is required to reduce the number of commits made in the 
kernel project with incorrect attribution.   To remove the need to
set environment variables, use the repo-config command to set some
defaults:

$ git repo-config user.email "you@email.com"
$ git repo-config user.name "your name"

HTH,
Sean

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Junio C Hamano @ 2006-04-09 19:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604091158310.9504@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Well, on the other hand, the new "git log --diff" should get the revision 
> counting right even if it's _not_ done by the caller.

Not if the user uses --diff-filter and/or --pickaxe, and after
we start omitting the log message part when no diff is output.

> So I'd suggest:
>  - drop git-whatchanged entirely
>  - keep it - for historical reasons - as a internal shorthand, and just 
>    turn it into "git log --diff -cc"
>
> and everybody will be happy (yeah, it will show a few merge commits 
> without diffs, because the diffs end up being uninteresting, but that's 
> _fine_, even if it's not 100% the same thing git-whatchanged used to do)

I tend to agree.  A merge commit touching a path but not
actually changing the contents of the path from parents might be
a significant event.

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Linus Torvalds @ 2006-04-09 19:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v3bgmbm8b.fsf@assigned-by-dhcp.cox.net>

On Sun, 9 Apr 2006, Junio C Hamano wrote:
> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > Well, on the other hand, the new "git log --diff" should get the revision 
> > counting right even if it's _not_ done by the caller.
> 
> Not if the user uses --diff-filter and/or --pickaxe, and after
> we start omitting the log message part when no diff is output.

Fair enough. At that point the counting does have to be done in the 
caller, I guess.

> A merge commit touching a path but not actually changing the contents of 
> the path from parents might be a significant event.

Yes. The fact that git-whatchanged happens to ignore such things right now 
is just a implementation detail, not a "good thing". The new git log seems 
to be better in pretty much all respects.

The bigger conceptual difference is actually that once you do revision 
pruning based on the pathname limiter, we prune away parents of merges 
that seem "uninteresting". So before, when you had the same change come 
through two different branches, "git-whatchanged" would actually show it 
twice, while the new "git log" approach would tend to show it just once 
(because it would pick one of the histories and ignore the other).

I think that's fine (and probably even preferable), but it's another 
example of something where we might want to have an option to not 
simplify the merge history. It's likely that nobody will ever care, but 
who knows..

			Linus

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Johannes Schindelin @ 2006-04-09 21:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0604091158310.9504@g5.osdl.org>

Hi,

On Sun, 9 Apr 2006, Linus Torvalds wrote:

>  - keep it - for historical reasons - as a internal shorthand, and just 
>    turn it into "git log --diff -cc"

It is "git log --cc", right? And BTW, I was burnt by the difference of 
"git-log" and "git log" this time. "git-log" does not understand "--cc". 
Could we kill "git-log", please?

Ciao,
Dscho

^ permalink raw reply

* Re: [ANNOUNCE] git-svnconvert: YASI (Yet Another SVN importer)
From: Rutger Nijlunsing @ 2006-04-09 21:15 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
In-Reply-To: <e1bdjq$qf6$1@sea.gmane.org>

On Sun, Apr 09, 2006 at 06:43:53PM +0200, Jakub Narebski wrote:
> Rutger Nijlunsing wrote:
> 
> > Since I didn't succeed in importing a (private) SVN repo into git, I
> > wrote a new converter to handle more cases like:
> 
> Both git-svn[*1*] and git-svnimport failed? Have you tried Tailor tool:
>   http://www.darcs.net/DarcsWiki/Tailor

git-svn and tailor can only track one branch (or trunk). As the
git-svn page states, it is for contributing to such a branch /
trunk. git-svnconvert is for converting a whole repository
incrementally of which branches (IMHO) are important to keep and
convert.

git-svnimport does handle multiple branches, but could not cope with
proxy + repo authentification, the weird repo layout I've had to cope
with (branches not only in /branches, several trunks) and some
revisions which contain non-sensical actions.

> >   - use command line svn instead of some perl library to have less
> >     dependancies and to support proxy + repo authentification.
> >     Might even work on MacOSX ;)
> 
> Instead adding dependence on Ruby, eh?

Take some, lose some ;)

Seriously, though, a dependancy on a mainstream language like
Python/Perl/Ruby/.. isn't a problem since a package is available for
all distributions. However, packages for mainstream languages are
quite often out-of-date or are not supported at all. Seeing a program
being dependant on a non-packaged module is enough for a truckload of
people to not even try it.

-- 
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

^ permalink raw reply

* Re: [ANNOUNCE] git-svnconvert: YASI (Yet Another SVN importer)
From: Johannes Schindelin @ 2006-04-09 21:30 UTC (permalink / raw)
  To: git; +Cc: Jakub Narebski, git
In-Reply-To: <20060409211505.GA30567@nospam.com>

Hi,

On Sun, 9 Apr 2006, Rutger Nijlunsing wrote:

> On Sun, Apr 09, 2006 at 06:43:53PM +0200, Jakub Narebski wrote:
> > 
> > Instead adding dependence on Ruby, eh?
> 
> Take some, lose some ;)
> 
> Seriously, though, a dependancy on a mainstream language like
> Python/Perl/Ruby/.. isn't a problem since a package is available for
> all distributions. However, packages for mainstream languages are
> quite often out-of-date or are not supported at all. Seeing a program
> being dependant on a non-packaged module is enough for a truckload of
> people to not even try it.

I have _never_ seen a setup where Ruby was installed by default. Perl 
always, Python often.

Furthermore, my feeling is that we are in the beginning phase of migration 
from scripting languages (which are good for prototyping) towards plain C. 
So adding yet another scripting language dependency is a little backwards.

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Johannes Schindelin @ 2006-04-09 22:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.63.0604092312340.29136@wbgn013.biozentrum.uni-wuerzburg.de>

Hi,

On Sun, 9 Apr 2006, Johannes Schindelin wrote:

> On Sun, 9 Apr 2006, Linus Torvalds wrote:
> 
> >  - keep it - for historical reasons - as a internal shorthand, and just 
> >    turn it into "git log --diff -cc"
> 
> It is "git log --cc", right?

Like this?

---

 git.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

751e205a9ffd3a55094a0c0f657735023776cf74
diff --git a/git.c b/git.c
index 8776088..3a94afa 100644
--- a/git.c
+++ b/git.c
@@ -385,6 +385,13 @@ static int cmd_log(int argc, const char 
 	return 0;
 }
 
+static int cmd_whatchanged(int argc, const char **argv, char **envp)
+{
+	memmove(argv + 2, argv + 1, argc - 1);
+	argv[1] = "--cc";
+	return cmd_log(argc + 1, argv, envp);
+}
+
 static void handle_internal_command(int argc, const char **argv, char **envp)
 {
 	const char *cmd = argv[0];
@@ -395,6 +402,7 @@ static void handle_internal_command(int 
 		{ "version", cmd_version },
 		{ "help", cmd_help },
 		{ "log", cmd_log },
+		{ "whatchanged", cmd_whatchanged },
 	};
 	int i;
 
-- 
1.2.0.g61002-dirty

^ permalink raw reply related

* Re: [ANNOUNCE] git-svnconvert: YASI (Yet Another SVN importer)
From: Randal L. Schwartz @ 2006-04-09 22:06 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Jakub Narebski, git
In-Reply-To: <Pine.LNX.4.63.0604092325590.29434@wbgn013.biozentrum.uni-wuerzburg.de>

>>>>> "Johannes" == Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

Johannes> I have _never_ seen a setup where Ruby was installed by
Johannes> default. Perl always, Python often.

OSX includes ruby by default.

Johannes> Furthermore, my feeling is that we are in the beginning phase of
Johannes> migration from scripting languages (which are good for prototyping)
Johannes> towards plain C.  So adding yet another scripting language
Johannes> dependency is a little backwards.

You seem a bit prejudiced here.  Are there performance problems in
the Perl and python parts of git?  If so, concentrate first on optimizing
the code where it matters.  Then, creating bindings to the "git lib"
so that the heavy lifting can be done in C while still providing for
the basic algorithms to be written in a higher level language.

It would be a step *backwards* to recode all of git in C.

Now, the *shell* parts, on the other hand, are screaming for a rewrite into
Perl or Python.  fork-fork-fork and worrying about escaping special characters
needlessly burns a lot of cpu and programmer time.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Timo Hirvonen @ 2006-04-09 22:22 UTC (permalink / raw)
  To: git; +Cc: torvalds, junkio, git
In-Reply-To: <Pine.LNX.4.63.0604100000430.30000@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:

> +static int cmd_whatchanged(int argc, const char **argv, char **envp)
> +{
> +	memmove(argv + 2, argv + 1, argc - 1);

Shouldn't the size be sizeof(char *) * argc (NULL terminated array)?
There's also overflow...

-- 
http://onion.dynserv.net/~timo/

^ permalink raw reply

* Re: [RFH] Exploration of an alternative diff_delta() algorithm
From: Peter Eriksen @ 2006-04-09 22:45 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.64.0604091340540.2215@localhost.localdomain>

On Sun, Apr 09, 2006 at 01:45:00PM -0400, Nicolas Pitre wrote:
...
> Try this with the README file from the git source tree:
> 
> 	sed s/git/GIT/g < ./README > /tmp/README.mod
> 	test-delta -d ./README /tmp/README.mod /tmp/README.delta
> 	[BOOM!]

I found the bug.  The code still has some limitations, but now
it passes the test suite.  Thanks for your help, Nicolas.

Peter

----->8---diff-delta.c---->8-------
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "delta.h"


#define BASE 257
#define PREFIX_SIZE 3

#define SIZE 10
#define HASH_TABLE_SIZE (1<<SIZE)

#define DELTA_SIZE (1024 * 1024)


unsigned int init_hash(unsigned char* data) {
  return data[0]*BASE*BASE + data[1]*BASE + data[2];
}

unsigned int hash(unsigned char* data, unsigned int hash) {
  return (hash - data[-1]*BASE*BASE)*BASE + data[2];
}

#define GR_PRIME 0x9e370001
#define HASH(v) ((v * GR_PRIME) >> (32 - SIZE))

struct entry {
  char file;
  char* offset;
};


void flush(struct entry* table) {
  memset(table, 0, HASH_TABLE_SIZE * sizeof(struct entry));
}


int same_prefixes(char* data1, char* data2) {
  return !memcmp(data1, data2, PREFIX_SIZE);  
}


void encode_add(char* out, int* outpos, char* version_start, char* version_copy) {
  unsigned int size = version_copy - version_start;
  if (!size) return;
  int pos = *outpos;

  while(size > 127) {
    out[pos++] = 127;
    memcpy(out + pos, version_start, 127);
    pos += 127;
    version_start += 127;
    size -= 127;
  }
  out[pos++] = size;
  memcpy(out + pos, version_start, size);  
  pos += size;

  *outpos = pos;
}


void encode_copy(char* out, int* outpos, int offset, int size) {
     int pos = (*outpos) + 1;
     int i = 0x80;

     if (offset & 0xff) { out[pos++] = offset; i |= 0x01; }
     offset >>= 8;
     if (offset & 0xff) { out[pos++] = offset; i |= 0x02; }
     offset >>= 8;
     if (offset & 0xff) { out[pos++] = offset; i |= 0x04; }
     offset >>= 8;
     if (offset & 0xff) { out[pos++] = offset; i |= 0x08; }

     if (size & 0xff) { out[pos++] = size; i |= 0x10; }
     size >>= 8;
     if (size & 0xff) { out[pos++] = size; i |= 0x20; }

     out[*outpos] = i;
     *outpos = pos;
}



void encode_size(char* out, int* outpos, unsigned long size) {
  int pos = *outpos;
  out[pos] = size;
  size >>= 7;
  while (size) {
    out[pos++] |= 0x80;
    out[pos] = size;
    size >>= 7;
  }
  *outpos = ++pos;
}




void *diff_delta(void *from_buf, unsigned long from_size,
		 void *to_buf, unsigned long to_size,
		 unsigned long *delta_size,
		 unsigned long max_size) {
  unsigned int index;
  unsigned int l;
  unsigned char* base = from_buf;
  unsigned char* version = to_buf;
  unsigned long base_size = from_size;
  unsigned long version_size = to_size;

  unsigned char* base_copy = base;
  unsigned char* version_copy = version;
  struct entry* table = calloc(HASH_TABLE_SIZE, sizeof(struct entry));
  //int delta_alloc = DELTA_SIZE;
  unsigned char* delta = malloc(DELTA_SIZE);
  unsigned int deltapos = 0;
  unsigned char* base_top = base + base_size;
  unsigned char* version_top = version + version_size;

  encode_size(delta, &deltapos, base_size);
  encode_size(delta, &deltapos, version_size);

  unsigned char* base_offset = base;
  unsigned char* version_offset = version;
  unsigned int base_hash = init_hash(base);
  unsigned int version_hash = init_hash(version);
  unsigned char* version_start = version;

  while(base_offset - base + PREFIX_SIZE < base_top - base && 
	version_offset - version + PREFIX_SIZE < version_top - version) {  
    // step2:

    index = HASH(base_hash);
    switch (table[index].file) {
    case '\0': {
      table[index].file = 'b';
      table[index].offset = base_offset;
      break;
    }
    case 'v': {
      if (same_prefixes(base_offset, table[index].offset)) {
	base_copy = base_offset;
	version_copy = table[index].offset;
	goto step3;
      } else break;
    }
    case 'b': break;
    default: printf("AAAAAARGH 2b\n");
    }
    
    index = HASH(version_hash);
    switch (table[index].file) {
    case '\0': {
      table[index].file = 'v';
      table[index].offset = version_offset;
      break;
    }
    case 'b': {
      if (same_prefixes(table[index].offset, version_offset)) {
	base_copy = table[index].offset;
	version_copy = version_offset;
	goto step3;
      } else break;
    }
    case 'v': break;
    default: printf("AAAAAARGH 2v\n");
    }
    
    base_offset++;
    version_offset++;

    base_hash = hash(base_offset, base_hash);
    version_hash = hash(version_offset, version_hash);
    continue;  //  goto step2;
    
  step3:
    l = 0;
    while(base_copy[l] == version_copy[l] && base_copy + l < base_top && version_copy + l < version_top) l++;
    base_offset = base_copy + l;
    version_offset = version_copy + l;
    
    /*
    // Make sure we don't run out of delta buffer when encoding.
    if((delta_alloc - deltapos) < 
       (version_start - version_copy) + 1 + 8 + (PREFIX_SIZE + 1)) {
      delta_alloc = delta_alloc * 3 / 2;
      delta = (char*) realloc(delta, delta_alloc);
    }
    */
	if(max_size && deltapos > max_size) {
		free(delta);
		free(table);
		return NULL;
	}

	//fprintf(stdout, "add: pos %u, v_start %u, v_copy %u\n", 
	//	deltapos, version_start - version, version_copy - version);	


    // step4:
    encode_add(delta, &deltapos, version_start, version_copy);

	//fprintf(stdout, "copy: pos %u, v_copy %u, l %u\n", 
	//	deltapos, base_copy - base, l);	


    encode_copy(delta, &deltapos, base_copy - base, l);
    
    // step5:
    flush(table);
    
    version_start = version_offset;
    
    base_hash = init_hash(base_offset);
    version_hash = init_hash(version_offset);
    
	//fprintf(stdout, "3) pos %u, v_start %u, v %u, b %u\n", 
	//	deltapos, version_start - version, version_offset - version, base_offset- base);	
  }  //  goto step2;
  
	//fprintf(stdout, "pos %u, v_start %u, v_top %u\n", 
	//	deltapos, version_start - version, version_size);	
  encode_add(delta, &deltapos, version_start, version + version_size);
  *delta_size = deltapos;
  free(table);
  return delta;
}

^ permalink raw reply

* Re: Fixes to parsecvs
From: Francois Romieu @ 2006-04-09 23:17 UTC (permalink / raw)
  To: Keith Packard; +Cc: Jan-Benedict Glaw, Git Mailing List
In-Reply-To: <1144334896.2303.259.camel@neko.keithp.com>

Keith Packard <keithp@keithp.com> :
[...]
> > How well does this work with even larger repositories?
> 
> postgresql is the largest I've run; starting with a 615M CVS repository,
> it built a 1.7G .git tree, which packed down to 125M.

As a datapoint, I gave parsecvs a try on a local CVS repository.
The repository weights 3.28 Go. It contains 53k files (45k non-attic).

.git/objets grew from ~100k files at the end of the first pass to
199k files (~11k commit). It took 18h on a 3GHz PIV with 2Go RAM.
After 6 hours, 400 Mo were pushed to swap and parsecvs took 1.95 Go
of RAM for itself. No significant swap activity. Swap grew to 900 Mo
at end of run. A tarball (5 Mo) containing vmstat + size of objects
is available at http://www.cogenit.fr/linux/misc/cvsparse-debug.tar.bz2

I have interrupted 'git repack -a -d' after 6 hours.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Junio C Hamano @ 2006-04-09 23:51 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Linus Torvalds, git
In-Reply-To: <Pine.LNX.4.63.0604100000430.30000@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi,
>
> On Sun, 9 Apr 2006, Johannes Schindelin wrote:
>
>> On Sun, 9 Apr 2006, Linus Torvalds wrote:
>> 
>> >  - keep it - for historical reasons - as a internal shorthand, and just 
>> >    turn it into "git log --diff -cc"
>> 
>> It is "git log --cc", right?
>
> Like this?

I do not think so.  You should default to --cc only there is no
explicit command line stuff from the user.

^ permalink raw reply

* Re: [PATCH] git log [diff-tree options]...
From: Linus Torvalds @ 2006-04-10  0:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git
In-Reply-To: <7vy7ye9uk8.fsf@assigned-by-dhcp.cox.net>

On Sun, 9 Apr 2006, Junio C Hamano wrote:
> 
> I do not think so.  You should default to --cc only there is no
> explicit command line stuff from the user.

Actually, even that would be wrong, when I think more about it. The 
default for "git-whatchanged" is to do diffing, but default to the "raw" 
diff (just "-r" for recursive).

So the most appropriate default set of flags is likely "-r -c", which also 
means that any subsequent explicit command line stuff will override it (ie 
adding a "-p" should automatically do the right thing).

But the "memmove()" to move the arguments around was definitely broken. 
Much better to just initialize the diff flags manually, I think.

		Linus

^ permalink raw reply

* [PATCH] Implement --fuzz= option for git-apply.
From: Eric W. Biederman @ 2006-04-10  2:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git


Currently to import the -mm tree I have to work around
git-apply by using patch.  Because some of Andrews
patches in quilt will only apply with fuzz.

Allow git-apply to handle fuzz makes it much easier to import
the -mm tree into git.  I am still only processing about 1.5 patch a
second which for the 692 patches in 2.6.17-rc1-mm2 is still painful
but it does help.

If I just apply the patches and don't run git-mailinfo
git-write-tree, and git-write-commit I get about 4 patches
per second.

This patch defaults to leaving fuzz processing off so if you don't
want patches that only apply with fuzz you won't get them.

If a patch does require fuzz to apply you will get a warning:
> Fragment applied at offset: +-#lines (fuzz: #context_lines_deleted)

diff --git a/apply.c b/apply.c
index 33b4271..a07503f 100644
--- a/apply.c
+++ b/apply.c
@@ -32,8 +32,9 @@ static int apply = 1;
 static int no_add = 0;
 static int show_index_info = 0;
 static int line_termination = '\n';
+static int p_fuzz = 0;
 static const char apply_usage[] =
-"git-apply [--stat] [--numstat] [--summary] [--check] [--index] [--apply] [--no-add] [--index-info] [--allow-binary-replacement] [-z] [-pNUM] [--whitespace=<nowarn|warn|error|error-all|strip>] <patch>...";
+"git-apply [--stat] [--numstat] [--summary] [--check] [--index] [--apply] [--no-add] [--index-info] [--allow-binary-replacement] [-z] [-pNUM] [--fuzz=NUM] [--whitespace=<nowarn|warn|error|error-all|strip>] <patch>...";
 
 static enum whitespace_eol {
 	nowarn_whitespace,
@@ -100,6 +101,7 @@ static int max_change, max_len;
 static int linenr = 1;
 
 struct fragment {
+	unsigned long context;
 	unsigned long oldpos, oldlines;
 	unsigned long newpos, newlines;
 	const char *patch;
@@ -817,12 +819,15 @@ static int parse_fragment(char *line, un
 	int added, deleted;
 	int len = linelen(line, size), offset;
 	unsigned long oldlines, newlines;
+	unsigned long leading, trailing;
 
 	offset = parse_fragment_header(line, len, fragment);
 	if (offset < 0)
 		return -1;
 	oldlines = fragment->oldlines;
 	newlines = fragment->newlines;
+	leading = 0;
+	trailing = 0;
 
 	if (patch->is_new < 0) {
 		patch->is_new =  !oldlines;
@@ -860,10 +865,14 @@ static int parse_fragment(char *line, un
 		case ' ':
 			oldlines--;
 			newlines--;
+			if (!deleted && !added)
+				leading++;
+			trailing++;
 			break;
 		case '-':
 			deleted++;
 			oldlines--;
+			trailing = 0;
 			break;
 		case '+':
 			/*
@@ -887,6 +896,7 @@ static int parse_fragment(char *line, un
 			}
 			added++;
 			newlines--;
+			trailing = 0;
 			break;
 
                 /* We allow "\ No newline at end of file". Depending
@@ -904,6 +914,10 @@ static int parse_fragment(char *line, un
 	}
 	if (oldlines || newlines)
 		return -1;
+	fragment->context = leading;
+	if (leading > trailing)
+		fragment->context = trailing;
+
 	/* If a fragment ends with an incomplete line, we failed to include
 	 * it in the above loop because we hit oldlines == newlines == 0
 	 * before seeing it.
@@ -1087,7 +1101,7 @@ static int read_old_data(struct stat *st
 	}
 }
 
-static int find_offset(const char *buf, unsigned long size, const char *fragment, unsigned long fragsize, int line)
+static int find_offset(const char *buf, unsigned long size, const char *fragment, unsigned long fragsize, int line, int *lines)
 {
 	int i;
 	unsigned long start, backwards, forwards;
@@ -1148,6 +1162,7 @@ static int find_offset(const char *buf, 
 		n = (i >> 1)+1;
 		if (i & 1)
 			n = -n;
+		*lines = n;
 		return try;
 	}
 
@@ -1155,6 +1170,31 @@ static int find_offset(const char *buf, 
 	 * We should start searching forward and backward.
 	 */
 	return -1;
+}
+
+static void reduce_context(char **buf, int *size)
+{
+	char *ctx = *buf;
+	unsigned long ctxsize = *size;
+	unsigned long offset;
+
+	/* Remove the first line */
+	offset = 0;
+	while (offset <= ctxsize) {
+		if (ctx[offset++] == '\n')
+			break;
+	}
+	ctxsize -= offset;
+	ctx += offset;
+	/* Remove the last line */
+	offset = ctxsize - 1;
+	while (offset > 0) {
+		if (ctx[--offset] == '\n')
+			break;
+	}
+	ctxsize = offset + 1;
+	*buf = ctx;
+	*size = ctxsize;
 }
 
 struct buffer_desc {
@@ -1192,7 +1232,10 @@ static int apply_one_fragment(struct buf
 	int offset, size = frag->size;
 	char *old = xmalloc(size);
 	char *new = xmalloc(size);
-	int oldsize = 0, newsize = 0;
+	char *ctx;
+	int oldsize = 0, newsize = 0, ctxsize;
+	int lines;
+	int fuzz, max_fuzz;
 
 	while (size > 0) {
 		int len = linelen(patch, size);
@@ -1241,23 +1284,39 @@ #ifdef NO_ACCURATE_DIFF
 		newsize--;
 	}
 #endif
+
+	offset = -1; /* shutup gcc */
+	ctx = old;
+	ctxsize = oldsize;
+	lines = 0;
+	max_fuzz = (p_fuzz < frag->context) ? p_fuzz : frag->context;
+	for (fuzz = 0; fuzz <= max_fuzz; fuzz++) {
+		/* Reduce the number of context lines */
+		if (fuzz) 
+			reduce_context(&ctx, &ctxsize);
+		offset = find_offset(buf, desc->size, ctx, ctxsize, frag->newpos + fuzz, &lines);
+		if (offset >= 0) {
+			int diff = newsize - ctxsize;
+			unsigned long size = desc->size + diff;
+			unsigned long alloc = desc->alloc;
+
+			if (fuzz)
+				fprintf(stderr, "Fragment applied at offset: %d (fuzz: %d)\n",
+					lines, fuzz);
+
+			if (size > alloc) {
+				alloc = size + 8192;
+				desc->alloc = alloc;
+				buf = xrealloc(buf, alloc);
+				desc->buffer = buf;
+			}
+			desc->size = size;
+			memmove(buf + offset + newsize, buf + offset + ctxsize, size - offset - newsize);
+			memcpy(buf + offset, new, newsize);
+			offset = 0;
 			
-	offset = find_offset(buf, desc->size, old, oldsize, frag->newpos);
-	if (offset >= 0) {
-		int diff = newsize - oldsize;
-		unsigned long size = desc->size + diff;
-		unsigned long alloc = desc->alloc;
-
-		if (size > alloc) {
-			alloc = size + 8192;
-			desc->alloc = alloc;
-			buf = xrealloc(buf, alloc);
-			desc->buffer = buf;
+			break;
 		}
-		desc->size = size;
-		memmove(buf + offset + newsize, buf + offset + oldsize, size - offset - newsize);
-		memcpy(buf + offset, new, newsize);
-		offset = 0;
 	}
 
 	free(old);
@@ -1943,6 +2002,10 @@ int main(int argc, char **argv)
 		}
 		if (!strcmp(arg, "-z")) {
 			line_termination = 0;
+			continue;
+		}
+		if (!strncmp(arg, "--fuzz=", 7)) {
+			p_fuzz = atoi(arg + 7);
 			continue;
 		}
 		if (!strncmp(arg, "--whitespace=", 13)) {

^ permalink raw reply related

* Re: [RFH] Exploration of an alternative diff_delta() algorithm
From: Nicolas Pitre @ 2006-04-10  3:29 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <20060409224548.GB21455@erlang.gbar.dtu.dk>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1001 bytes --]

On Mon, 10 Apr 2006, Peter Eriksen wrote:

> On Sun, Apr 09, 2006 at 01:45:00PM -0400, Nicolas Pitre wrote:
> ...
> > Try this with the README file from the git source tree:
> > 
> > 	sed s/git/GIT/g < ./README > /tmp/README.mod
> > 	test-delta -d ./README /tmp/README.mod /tmp/README.delta
> > 	[BOOM!]
> 
> I found the bug.  The code still has some limitations, but now
> it passes the test suite.  Thanks for your help, Nicolas.

OK here's some more meat for you:

Copy the same README file from the git source tree, then edit the copied 
version so the "Blob Object" section and the "Tree Object" section are 
swapped around like shown in the attached patch.

The best delta that can be achieved is 24 bytes.

With the current code the produced delta is 42 bytes.

With your code the resulting delta is 4978 bytes, about twice as large 
as the attached patch.

One major limitation of your algorithm appears to not have a global view 
of the base buffer before starting to find matches.


Nicolas

[-- Attachment #2: Type: TEXT/PLAIN, Size: 2372 bytes --]

--- f1	2006-04-09 13:31:26.000000000 -0400
+++ f2	2006-04-09 23:04:10.000000000 -0400
@@ -87,26 +87,6 @@
 
 The object types in some more detail:
 
-Blob Object
-~~~~~~~~~~~
-A "blob" object is nothing but a binary blob of data, and doesn't
-refer to anything else.  There is no signature or any other
-verification of the data, so while the object is consistent (it 'is'
-indexed by its sha1 hash, so the data itself is certainly correct), it
-has absolutely no other attributes.  No name associations, no
-permissions.  It is purely a blob of data (i.e. normally "file
-contents").
-
-In particular, since the blob is entirely defined by its data, if two
-files in a directory tree (or in multiple different versions of the
-repository) have the same contents, they will share the same blob
-object. The object is totally independent of its location in the
-directory tree, and renaming a file does not change the object that
-file is associated with in any way.
-
-A blob is typically created when gitlink:git-update-index[1]
-is run, and its data can be accessed by gitlink:git-cat-file[1].
-
 Tree Object
 ~~~~~~~~~~~
 The next hierarchical object type is the "tree" object.  A tree object
@@ -147,6 +127,26 @@
 its data can be accessed by gitlink:git-ls-tree[1].
 Two trees can be compared with gitlink:git-diff-tree[1].
 
+Blob Object
+~~~~~~~~~~~
+A "blob" object is nothing but a binary blob of data, and doesn't
+refer to anything else.  There is no signature or any other
+verification of the data, so while the object is consistent (it 'is'
+indexed by its sha1 hash, so the data itself is certainly correct), it
+has absolutely no other attributes.  No name associations, no
+permissions.  It is purely a blob of data (i.e. normally "file
+contents").
+
+In particular, since the blob is entirely defined by its data, if two
+files in a directory tree (or in multiple different versions of the
+repository) have the same contents, they will share the same blob
+object. The object is totally independent of its location in the
+directory tree, and renaming a file does not change the object that
+file is associated with in any way.
+
+A blob is typically created when gitlink:git-update-index[1]
+is run, and its data can be accessed by gitlink:git-cat-file[1].
+
 Commit Object
 ~~~~~~~~~~~~~
 The "commit" object is an object that introduces the notion of

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox