Git development
 help / color / mirror / Atom feed
* Re: [PATCH 03/10] push: support pushing HEAD to real branch name
From: Junio C Hamano @ 2007-10-30  8:28 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: git
In-Reply-To: <1193593581486-git-send-email-prohaska@zib.de>

Nice.

^ permalink raw reply

* Re: [PATCH] Fix a small memory leak in builtin-add
From: Junio C Hamano @ 2007-10-30  8:28 UTC (permalink / raw)
  To: Benoit Sigoure; +Cc: git, gitster
In-Reply-To: <1193641233-3880-2-git-send-email-tsuna@lrde.epita.fr>

Benoit Sigoure <tsuna@lrde.epita.fr> writes:

> prune_directory and fill_directory allocated one byte per pathspec and never
> freed it.
>
> Signed-off-by: Benoit Sigoure <tsuna@lrde.epita.fr>
> ---
> I don't know whether this was intentionnal or not (since it's
> only a matter of having small chunks of memory not being
> reclaimed and I don't think we can enter these functions
> twice, so these are not leaks in the sense that the program
> will consume more and more memory).

I think it was just a slop.  Thanks for catching it.

^ permalink raw reply

* Re: [PATCH 0/4] Build in some more things
From: Junio C Hamano @ 2007-10-30  8:14 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0710292049450.7357@iabervon.org>

I've merged this to 'pu', but honestly speaking, the conflicts
are geting a bit more "interesting" than I care to keep
repeating even with help from rerere, with four people
simultaneously touching the neighbouring code in their topics.

Topics involved are:

** db/remote-builtin (Mon Oct 29 22:03:42 2007 -0400) 4 commits
 . Use built-in send-pack.
 . Build-in send-pack, with an API for other programs to call.
 . Build-in peek-remote, using transport infrastructure.
 . Miscellaneous const changes and utilities

*  jk/send-pack (Thu Oct 18 02:17:46 2007 -0400) 2 commits
 + t5516: test update of local refs on push
 + send-pack: don't update tracking refs on error

*  js/forkexec (Fri Oct 19 21:48:06 2007 +0200)
 + Use start_command() in git_connect() instead of explicit
   fork/exec.
 + Change git_connect() to return a struct child_process instead of a
   pid_t.

** sp/push-refspec (Sun Oct 28 18:46:21 2007 +0100)
 . push: teach push to pass --verbose option to transport layer
 . push: use same rules as git-rev-parse to resolve refspecs
 . add ref_abbrev_matches_full_with_rev_parse_rules() comparing
   abbrev with full ref name
 . rename ref_matches_abbrev() to
   ref_abbrev_matches_full_with_fetch_rules()

Could you please check the result after I push it out?

^ permalink raw reply

* Possible bug: git-svn leaves broken tree in case of error
From: Anton Korobeynikov @ 2007-10-30  7:30 UTC (permalink / raw)
  To: git

Hello, Everyone.

I noticed this bug several times. Consider the following conditions are
met:
- We're syncing from svn using git-svn :)
- We have authors file provided
- We have a changeset with author unlisted in the authors file.

git-svn dies due to the following code:
sub check_author {
        my ($author) = @_;
        if (!defined $author || length $author == 0) {
                $author = '(no author)';
        }
        if (defined $::_authors && ! defined $::users{$author}) {
                die "Author: $author not defined in $::_authors file\n";
        }
        $author;
}

Unfortunately it leaves repository in some middle state: git-svn itself
thinks, that it synced with everything, but git itself doesn't "see" any
changesets anymore. I found no way to repair tree after such situation,
so I had to repull from scratch.

I found myself, that this should be warning (and fix in this case is
trivial), not error (maybe some commandline switch to control behaviour,
etc). It can be even error, but breaking tree is definitely bug in this
case.

Any thoughts?

-- 
With best regards, Anton Korobeynikov.

Faculty of Mathematics & Mechanics, Saint Petersburg State University.

^ permalink raw reply

* [PATCH v3] Speedup scanning for excluded files.
From: Lars Knoll @ 2007-10-30  7:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Morten Welinder, git, Pierre Habouzit
In-Reply-To: <200710300828.49840.lars@trolltech.com>

Try to avoid a lot of work scanning for excluded files,
by caching some more information when setting up the exclusion
data structure.

Speeds up 'git runstatus' on a repository containing the Qt sources by 30% and
reduces the amount of instructions executed (as measured by valgrind) by a
factor of 2. A 'git runstatus' on the git repository goes from 100M instructions
down to about 22M.

Signed-off-by: Lars Knoll <lars@trolltech.com>
---

Included an out of bounds check for the issue Morten found.

 dir.c |   61 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 dir.h |    7 +++++++
 2 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/dir.c b/dir.c
index 4c17d36..b3d7462 100644
--- a/dir.c
+++ b/dir.c
@@ -118,14 +118,32 @@ int match_pathspec(const char **pathspec, const char *name, int namelen, int pre
 	return retval;
 }
 
+static int no_wildcard(const char *string)
+{
+	return string[strcspn(string, "*?[{")] == '\0';
+}
+
 void add_exclude(const char *string, const char *base,
 		 int baselen, struct exclude_list *which)
 {
 	struct exclude *x = xmalloc(sizeof (*x));
 
+	x->to_exclude = 1;
+	if (*string == '!') {
+		x->to_exclude = 0;
+		string++;
+	}
 	x->pattern = string;
+	x->patternlen = strlen(string);
 	x->base = base;
 	x->baselen = baselen;
+	x->flags = 0;
+	if (!strchr(string, '/'))
+		x->flags |= EXC_FLAG_NODIR;
+	if (no_wildcard(string))
+		x->flags |= EXC_FLAG_NOWILDCARD;
+	if (*string == '*' && no_wildcard(string+1))
+		x->flags |= EXC_FLAG_ENDSWITH;
 	if (which->nr == which->alloc) {
 		which->alloc = alloc_nr(which->alloc);
 		which->excludes = xrealloc(which->excludes,
@@ -209,7 +227,7 @@ void pop_exclude_per_directory(struct dir_struct *dir, int stk)
  * Return 1 for exclude, 0 for include and -1 for undecided.
  */
 static int excluded_1(const char *pathname,
-		      int pathlen,
+		      int pathlen, const char *basename, 
 		      struct exclude_list *el)
 {
 	int i;
@@ -218,19 +236,21 @@ static int excluded_1(const char *pathname,
 		for (i = el->nr - 1; 0 <= i; i--) {
 			struct exclude *x = el->excludes[i];
 			const char *exclude = x->pattern;
-			int to_exclude = 1;
+			int to_exclude = x->to_exclude;
 
-			if (*exclude == '!') {
-				to_exclude = 0;
-				exclude++;
-			}
-
-			if (!strchr(exclude, '/')) {
+			if (x->flags & EXC_FLAG_NODIR) {
 				/* match basename */
-				const char *basename = strrchr(pathname, '/');
-				basename = (basename) ? basename+1 : pathname;
-				if (fnmatch(exclude, basename, 0) == 0)
-					return to_exclude;
+				if (x->flags & EXC_FLAG_NOWILDCARD) {
+					if (!strcmp(exclude, basename))
+						return to_exclude;
+				} else if (x->flags & EXC_FLAG_ENDSWITH) {
+					if (x->patternlen - 1 <= pathlen &&
+                                            !strcmp(exclude + 1, pathname + pathlen - x->patternlen + 1))
+						return to_exclude;
+				} else {
+					if (fnmatch(exclude, basename, 0) == 0)
+						return to_exclude;
+				}
 			}
 			else {
 				/* match with FNM_PATHNAME:
@@ -246,9 +266,14 @@ static int excluded_1(const char *pathname,
 				    strncmp(pathname, x->base, baselen))
 				    continue;
 
-				if (fnmatch(exclude, pathname+baselen,
-					    FNM_PATHNAME) == 0)
-					return to_exclude;
+				if (x->flags & EXC_FLAG_NOWILDCARD) {
+					if (!strcmp(exclude, pathname + baselen))
+						return to_exclude;
+				} else {
+					if (fnmatch(exclude, pathname+baselen,
+						    FNM_PATHNAME) == 0)
+					    return to_exclude;
+				}
 			}
 		}
 	}
@@ -259,9 +284,11 @@ int excluded(struct dir_struct *dir, const char *pathname)
 {
 	int pathlen = strlen(pathname);
 	int st;
-
+	const char *basename = strrchr(pathname, '/');
+	basename = (basename) ? basename+1 : pathname;
+	
 	for (st = EXC_CMDL; st <= EXC_FILE; st++) {
-		switch (excluded_1(pathname, pathlen, &dir->exclude_list[st])) {
+		switch (excluded_1(pathname, pathlen, basename, &dir->exclude_list[st])) {
 		case 0:
 			return 0;
 		case 1:
diff --git a/dir.h b/dir.h
index a248a23..3ce8dbe 100644
--- a/dir.h
+++ b/dir.h
@@ -17,13 +17,20 @@ struct dir_entry {
 	char name[FLEX_ARRAY]; /* more */
 };
 
+#define EXC_FLAG_NODIR 1
+#define EXC_FLAG_NOWILDCARD 2
+#define EXC_FLAG_ENDSWITH 4
+
 struct exclude_list {
 	int nr;
 	int alloc;
 	struct exclude {
 		const char *pattern;
+		int patternlen;		   
 		const char *base;
 		int baselen;
+		int to_exclude;
+		int flags;
 	} **excludes;
 };
 
-- 
1.5.3.4.383.gd90a7

^ permalink raw reply related

* Re: [PATCH] Speedup scanning for excluded files.
From: Lars Knoll @ 2007-10-30  7:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Morten Welinder, git, Pierre Habouzit
In-Reply-To: <7vzly1sc7p.fsf@gitster.siamese.dyndns.org>

On Tuesday 30 October 2007, Junio C Hamano wrote:
> "Morten Welinder" <mwelinder@gmail.com> writes:
> >> +                               } else if (x->flags & EXC_FLAG_ENDSWITH)
> >> { +                                       if (!strcmp(exclude + 1,
> >> pathname + pathlen -x->patternlen + 1))
> >
> > Is there some guarantee that the result of that subtraction is still
> > within the string?
>
> Good eyes.
>
> If pattern is "*.exe", patternlen is 5, and strcmp wants to
> compare 4 chars, so pathlen is better be at least that long, and
> we do allow that pattern to match a hidden file ".exe".
>
> Like this?
>
> 	if (x->patternlen - 1 <= pathlen &&
>         	!strcmp(exclude + 1, pathname + pathlen - x->patternlen + 1))
> 		return to_exclude;

Yes, that looks right. Thanks for catching that one.

Lars

^ permalink raw reply

* Re: [PATCH 0/4] Build in some more things
From: Johannes Sixt @ 2007-10-30  7:24 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0710292049450.7357@iabervon.org>

Daniel Barkalow schrieb:
> The main effect of this series is removing the fork/exec from pushing via 
> the git protocol (aside from the later fork/exec in connect.c of course).
> 
> It also heads off some tempting transport-related fetch bugs, which I will 
> not introduce in a later patch.
> 
> * Miscellaneous const changes and utilities
>   Adds two small utility functions, and marks a bunch of stuff as const; 
>   the const stuff is to keep builtin-fetch from getting messed up without 
>   a warning, because it wants some lists not to change.
> 
> * Build-in peek-remote, using transport infrastructure.
> * Build-in send-pack, with an API for other programs to call.
> * Use built-in send-pack.

I assume this goes on top of current master or db/fetch-pack. The patches 
have some conflicts with js/forkexec (nothing serious, though). Maybe it 
makes sense to rebase on top of that.

-- Hannes

^ permalink raw reply

* Re: [PATCH 1/2] add throughput to progress display
From: Johannes Sixt @ 2007-10-30  7:07 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <alpine.LFD.0.9999.0710291904190.22100@xanadu.home>

Nicolas Pitre schrieb:
> diff --git a/progress.h b/progress.h
> index 07b56bd..eba457f 100644
> --- a/progress.h
> +++ b/progress.h
> @@ -1,6 +1,21 @@
>  #ifndef PROGRESS_H
>  #define PROGRESS_H
>  
> +#include <sys/time.h>

System headers should be taken care of by git-compat-util.h. In fact, it 
already pulls in <sys/time.h>, hence, it shouldn't be necessary here.

-- Hannes

^ permalink raw reply

* Re: How to run git-gui always in English?
From: Peter Karlsson @ 2007-10-29 12:58 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Steffen Prohaska, Shawn O. Pearce, Git Mailing List
In-Reply-To: <20071026214124.GB3062@steel.home>

Alex Riesen:

> Because you do not send changes to a _server_. There is no server. There 
> is just another repo. Hence just "push"

Fine. "Send to repository", then. My point is that if "push" is a technical 
term, then it doesn't belong in the GUI, and if it isn't, then it should be 
translated like any other UI element.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply

* Re: remote#branch
From: Tom Prince @ 2007-10-30  5:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Junio C Hamano, Jan Hudec, Johannes Schindelin,
	Petr Baudis, Paolo Ciarrocchi, git
In-Reply-To: <alpine.LFD.0.999.0710292150400.30120@woody.linux-foundation.org>

On Mon, Oct 29, 2007 at 09:51:47PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 30 Oct 2007, Theodore Tso wrote:
> > 
> > Mirror URL  git://repo.or.cz/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
> >             http://repo.or.cz/r/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
> > Push URL    git+ssh://repo.or.cz/srv/git/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
> 
> No.
> 
> The push url is generally written as
> 
> 	repo.or.cz:/srv/git/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
> 
> Tough.

But gitweb (on git.kernel.org and repo.or.cz) both give git:// locators.

> 
> > Quick!  Which of the URL-like strings follow the URL quoting rules,
> > and which ones don't?
> 
> Quick! WHO THE F*CK CARES?
> 

So, how should git deal with

git://repo.or.cz/linux-2.6/linux acpi-2.6/ibm-acpi-2.6.git
git://repo.or.cz/linux-2.6/linux+acpi-2.6/ibm-acpi-2.6.git
git://repo.or.cz/linux-2.6/linux%20acpi-2.6/ibm-acpi-2.6.git

compared to 

http://repo.or.cz/linux-2.6/linux acpi-2.6/ibm-acpi-2.6.git
http://repo.or.cz/linux-2.6/linux+acpi-2.6/ibm-acpi-2.6.git
http://repo.or.cz/linux-2.6/linux%20acpi-2.6/ibm-acpi-2.6.git

Note: this doesn't have anything to do with server:/path/to/repo

Not that I care, but git should probably handle things consistently.

  Tom

^ permalink raw reply

* Re: [PATCH/RFC 0/3] faster inexact rename handling
From: Linus Torvalds @ 2007-10-30  5:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Andy C, Junio C Hamano
In-Reply-To: <20071030042118.GA14729@sigill.intra.peff.net>



On Tue, 30 Oct 2007, Jeff King wrote:
>
>   - no improvement on smaller datasets. Running "git-whatchanged -M
>     --raw -l0" on the linux-2.6 repo takes about the same time with the
>     old and new code (presumably the algorithmic savings of the new code
>     are lost in a higher constant factor, so when n is small, it is a
>     wash).

Have you compared the results? IOW, does it find the *same* renames?

I'm a bit worried about the fact that you just pick a single (arbitrary) 
src/dst per fingerprint. Yes, it should be limited, but that seems to be a 
bit too *extremely* limited. But if it gives the same results in practice, 
maybe nobody cares?

		Linus

^ permalink raw reply

* Re: remote#branch
From: Linus Torvalds @ 2007-10-30  4:51 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Junio C Hamano, Jan Hudec, Johannes Schindelin, Petr Baudis,
	Paolo Ciarrocchi, git
In-Reply-To: <20071030044026.GA9600@thunk.org>



On Tue, 30 Oct 2007, Theodore Tso wrote:
> 
> Mirror URL  git://repo.or.cz/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
>             http://repo.or.cz/r/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
> Push URL    git+ssh://repo.or.cz/srv/git/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git

No.

The push url is generally written as

	repo.or.cz:/srv/git/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git

Tough.

> Quick!  Which of the URL-like strings follow the URL quoting rules,
> and which ones don't?

Quick! WHO THE F*CK CARES?

		Linus

^ permalink raw reply

* Re: remote#branch
From: Linus Torvalds @ 2007-10-30  4:50 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Hudec, Johannes Schindelin, Petr Baudis, Paolo Ciarrocchi,
	git
In-Reply-To: <20071030030104.GK21133@thunk.org>



On Mon, 29 Oct 2007, Theodore Tso wrote:
> 
> So let's not call them URL's, since they're clearly not.

Hey, you go ahead.

Me, I'll refuse to make up a new name just because somebody thinks they 
"own" the concept of URL's.

Let's face it, nobody really cares.

		Linus

^ permalink raw reply

* Re: stg branch --create test v2.6.24-rc1 doesn't work
From: Aneesh Kumar @ 2007-10-30  4:50 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: kha, Git Mailing List
In-Reply-To: <b0943d9e0710291011p11bd8901udeb758fa653610bc@mail.gmail.com>

On 10/29/07, Catalin Marinas <catalin.marinas@gmail.com> wrote:
> On 26/10/2007, Aneesh Kumar <aneesh.kumar@gmail.com> wrote:
> > $ stg branch --create test v2.6.24-rc1
> > Checking for changes in the working directory ... done
> > Don't know how to determine parent branch from "v2.6.24-rc1"
> > Branch "test" created
> > [test@linux-review-ext]$ git log
> >
> >
> > Throws an error/warning in the above command.
> >
> > Also import then gives
> >
> > [test@linux-review-ext]$ stg import
> > /home/opensource/patches/ext4-patch-queue/ext4_mballoc_freespace_accounting_fix.patch
> > Checking for changes in the working directory ... done
> > fatal: cebdeed27b068dcc3e7c311d7ec0d9c33b5138c2 is not a valid 'commit' object
> > stg import: git-commit-tree failed with code 128
>
> What version of StGIT are you using?
>


Latest stgit.  d4356ac6143757131b58c95ca5c6a7b386cc6087


> You could try 'stg branch --create test v2.6.24-rc1^{commit} but I
> thought latest StGIT adds this by default.
>


-aneesh

^ permalink raw reply

* Re: remote#branch
From: Theodore Tso @ 2007-10-30  4:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Jan Hudec, Johannes Schindelin, Petr Baudis,
	Paolo Ciarrocchi, git
In-Reply-To: <7vtzo9s221.fsf@gitster.siamese.dyndns.org>

On Mon, Oct 29, 2007 at 08:40:06PM -0700, Junio C Hamano wrote:
> Theodore Tso <tytso@mit.edu> writes:
> 
> > It would probably also be a good idea to expurgate URL's from the
> > documentations, because, well, they aren't URL's.  Git doesn't treat
> > them like URL's, and you've said you aren't interested in changing git
> > to treat them like URL's, and finally git:// isn't a registered URL
> > scheme name with the IANA registration authority.  So let's not call
> > them URL's, since they're clearly not.
> 
> Are you playing reverse psychology, encouraging us to switch to
> RFC-conforming quoting?

Well, URL's have a specific meaning and they're not for web browsing.
They are used to specify the addresses of printers, e-mail addresses,
LDAP servers, and much more.  Git is using something that looks like
URL's, but they they don't follow the URL rules.  So this brings up
interesting questions when a git webpage displayes something like
this (taken from repo.or.cz):

Mirror URL  git://repo.or.cz/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
            http://repo.or.cz/r/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
Push URL    git+ssh://repo.or.cz/srv/git/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git

Quick!  Which of the URL-like strings follow the URL quoting rules,
and which ones don't?  And if you have a character that must be quoted
in the pthname, should be quoted in in the http:// line?  And does it
matter if you are passing that URL-like string to a web browser or to
git-clone?

This is not language lawyering; this is a consistency issue that
matters.  But if Linus is going to say that git isn't going to follow
the URL quoting rules because it isn't a web browser, then clearly
what we are using aren't URL's.  So let's not *call* them URL's in our
documentation, because we're not following the URL rules.  That way is
only going to spawn more confusion.

Is this reverse psychology?  Well, I think git needs to pick whether
we operate on URL's or just things that *look* like URL's.  Either
way, the documentation should be specific about what's going on.  Me,
I'd prefer that git be liberal in what it accepts, and conservative in
what it sends.  That means that it would be useful if git could accept
URL's that are correctly quoted, and let things slide if characters
that should be quoted, aren't.  Git rarely actually emits URL's, and
mercifully people rarely use things like '#' characters in their
pathnames, so I don't think it would be that hard to make the
transition.

     	 	       	      	 	  - Ted

^ permalink raw reply

* [PATCH] Bisect: add "skip" to the short usage string.
From: Christian Couder @ 2007-10-30  4:40 UTC (permalink / raw)
  To: Junio Hamano, David Symonds; +Cc: git

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 git-bisect.sh |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

	This is an alternative patch to the previous one
	with the same name.

	Here I put "skip" just after "bad" and "good" as
	suggested by David Symonds, both in the short and
	long usage string.

diff --git a/git-bisect.sh b/git-bisect.sh
index 180c6c2..b74f44d 100755
--- a/git-bisect.sh
+++ b/git-bisect.sh
@@ -1,12 +1,14 @@
 #!/bin/sh
 
-USAGE='[start|bad|good|next|reset|visualize|replay|log|run]'
+USAGE='[start|bad|good|skip|next|reset|visualize|replay|log|run]'
 LONG_USAGE='git bisect start [<bad> [<good>...]] [--] [<pathspec>...]
         reset bisect state and start bisection.
 git bisect bad [<rev>]
         mark <rev> a known-bad revision.
 git bisect good [<rev>...]
         mark <rev>... known-good revisions.
+git bisect skip [<rev>...]
+        mark <rev>... untestable revisions.
 git bisect next
         find next bisection to test and check it out.
 git bisect reset [<branch>]
@@ -17,8 +19,6 @@ git bisect replay <logfile>
         replay bisection log.
 git bisect log
         show bisect log.
-git bisect skip [<rev>...]
-        mark <rev>... untestable revisions.
 git bisect run <cmd>...
         use <cmd>... to automatically bisect.'
 
-- 
1.5.3.4.1405.g6c7bae

^ permalink raw reply related

* [PATCH/RFC 3/3] handle renames using similarity engine
From: Jeff King @ 2007-10-30  4:24 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano
In-Reply-To: <20071030042118.GA14729@sigill.intra.peff.net>

This changes diffcore-rename to use the engine in
similarity.c rather than doing an O(m*n) loop around
diffcore_count_changes.

Signed-off-by: Jeff King <peff@peff.net>
---
 diffcore-rename.c |  215 +++++++++++++++++++----------------------------------
 1 files changed, 76 insertions(+), 139 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index ba038af..6c0d2d0 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -5,12 +5,21 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "hash.h"
+#include "similarity.h"
 
-/* Table of rename/copy destinations */
+/* Table of rename/copy src files */
+static struct diff_rename_src {
+	struct diff_filespec *one;
+	unsigned short score; /* to remember the break score */
+} *rename_src;
+static int rename_src_nr, rename_src_alloc;
 
+/* Table of rename/copy destinations */
 static struct diff_rename_dst {
 	struct diff_filespec *two;
 	struct diff_filepair *pair;
+	struct diff_rename_src *best_match;
+	unsigned score;
 } *rename_dst;
 static int rename_dst_nr, rename_dst_alloc;
 
@@ -49,16 +58,11 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
 	rename_dst[first].two = alloc_filespec(two->path);
 	fill_filespec(rename_dst[first].two, two->sha1, two->mode);
 	rename_dst[first].pair = NULL;
+	rename_dst[first].best_match = NULL;
+	rename_dst[first].score = 0;
 	return &(rename_dst[first]);
 }
 
-/* Table of rename/copy src files */
-static struct diff_rename_src {
-	struct diff_filespec *one;
-	unsigned short score; /* to remember the break score */
-} *rename_src;
-static int rename_src_nr, rename_src_alloc;
-
 static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 						   unsigned short score)
 {
@@ -109,88 +113,6 @@ static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
 		(!dst_len || dst->path[dst_len - 1] == '/');
 }
 
-struct diff_score {
-	int src; /* index in rename_src */
-	int dst; /* index in rename_dst */
-	int score;
-	int name_score;
-};
-
-static int estimate_similarity(struct diff_filespec *src,
-			       struct diff_filespec *dst,
-			       int minimum_score)
-{
-	/* src points at a file that existed in the original tree (or
-	 * optionally a file in the destination tree) and dst points
-	 * at a newly created file.  They may be quite similar, in which
-	 * case we want to say src is renamed to dst or src is copied into
-	 * dst, and then some edit has been applied to dst.
-	 *
-	 * Compare them and return how similar they are, representing
-	 * the score as an integer between 0 and MAX_SCORE.
-	 *
-	 * When there is an exact match, it is considered a better
-	 * match than anything else; the destination does not even
-	 * call into this function in that case.
-	 */
-	unsigned long max_size, delta_size, base_size, src_copied, literal_added;
-	unsigned long delta_limit;
-	int score;
-
-	/* We deal only with regular files.  Symlink renames are handled
-	 * only when they are exact matches --- in other words, no edits
-	 * after renaming.
-	 */
-	if (!S_ISREG(src->mode) || !S_ISREG(dst->mode))
-		return 0;
-
-	/*
-	 * Need to check that source and destination sizes are
-	 * filled in before comparing them.
-	 *
-	 * If we already have "cnt_data" filled in, we know it's
-	 * all good (avoid checking the size for zero, as that
-	 * is a possible size - we really should have a flag to
-	 * say whether the size is valid or not!)
-	 */
-	if (!src->cnt_data && diff_populate_filespec(src, 0))
-		return 0;
-	if (!dst->cnt_data && diff_populate_filespec(dst, 0))
-		return 0;
-
-	max_size = ((src->size > dst->size) ? src->size : dst->size);
-	base_size = ((src->size < dst->size) ? src->size : dst->size);
-	delta_size = max_size - base_size;
-
-	/* We would not consider edits that change the file size so
-	 * drastically.  delta_size must be smaller than
-	 * (MAX_SCORE-minimum_score)/MAX_SCORE * min(src->size, dst->size).
-	 *
-	 * Note that base_size == 0 case is handled here already
-	 * and the final score computation below would not have a
-	 * divide-by-zero issue.
-	 */
-	if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
-		return 0;
-
-	delta_limit = (unsigned long)
-		(base_size * (MAX_SCORE-minimum_score) / MAX_SCORE);
-	if (diffcore_count_changes(src, dst,
-				   &src->cnt_data, &dst->cnt_data,
-				   delta_limit,
-				   &src_copied, &literal_added))
-		return 0;
-
-	/* How similar are they?
-	 * what percentage of material in dst are from source?
-	 */
-	if (!dst->size)
-		score = 0; /* should not happen */
-	else
-		score = (int)(src_copied * MAX_SCORE / max_size);
-	return score;
-}
-
 static void record_rename_pair(int dst_index, int src_index, int score)
 {
 	struct diff_filespec *src, *dst;
@@ -215,20 +137,6 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 	rename_dst[dst_index].pair = dp;
 }
 
-/*
- * We sort the rename similarity matrix with the score, in descending
- * order (the most similar first).
- */
-static int score_compare(const void *a_, const void *b_)
-{
-	const struct diff_score *a = a_, *b = b_;
-
-	if (a->score == b->score)
-		return b->name_score - a->name_score;
-
-	return b->score - a->score;
-}
-
 struct file_similarity {
 	int src_dst, index;
 	struct diff_filespec *filespec;
@@ -376,6 +284,67 @@ static int find_exact_renames(void)
 	return i;
 }
 
+static void record_similarity(void *vsrc, void *vdst, unsigned score)
+{
+	struct diff_rename_src *src = vsrc;
+	struct diff_rename_dst *dst = vdst;
+	unsigned max_size = (src->one->size > dst->two->size) ?
+				src->one->size : dst->two->size;
+
+	score = (dst->two->size != 0) ? (score * MAX_SCORE / max_size) : 0;
+
+	/* Is there a match already that is better than we are? */
+	if (dst->best_match) {
+		if (score < dst->score)
+			return;
+		if (score == dst->score && !basename_same(src->one, dst->two))
+			return;
+	}
+
+	dst->best_match = src;
+	dst->score = score;
+}
+
+static void find_approximate_renames(int minimum_score)
+{
+	struct similarity sim;
+	int i;
+
+	similarity_init(&sim);
+
+	for (i = 0; i < rename_src_nr; i++) {
+		struct diff_rename_src *s = &rename_src[i];
+		diff_populate_filespec(s->one, 0);
+		similarity_add(&sim, SIMILARITY_SOURCE, s,
+				s->one->data, s->one->size,
+				diff_filespec_is_binary(s->one));
+		diff_free_filespec_data(s->one);
+	}
+
+	for (i = 0; i < rename_dst_nr; i++) {
+		struct diff_rename_dst *d = &rename_dst[i];
+		if (d->pair)
+			continue;
+		diff_populate_filespec(d->two, 0);
+		similarity_add(&sim, SIMILARITY_DEST, d,
+				d->two->data, d->two->size,
+				diff_filespec_is_binary(d->two));
+		diff_free_filespec_data(d->two);
+	}
+
+	similarity_score(&sim);
+	similarity_report(&sim, record_similarity);
+
+	for (i = 0 ; i < rename_dst_nr; i++) {
+		struct diff_rename_dst *d = &rename_dst[i];
+		if (d->pair)
+			continue;
+		if (d->score < minimum_score)
+			continue;
+		record_rename_pair(i, d->best_match - rename_src, d->score);
+	}
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -383,9 +352,8 @@ void diffcore_rename(struct diff_options *options)
 	int rename_limit = options->rename_limit;
 	struct diff_queue_struct *q = &diff_queued_diff;
 	struct diff_queue_struct outq;
-	struct diff_score *mx;
-	int i, j, rename_count;
-	int num_create, num_src, dst_cnt;
+	int num_create, num_src;
+	int i, rename_count;
 
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -462,38 +430,7 @@ void diffcore_rename(struct diff_options *options)
 	if (num_create * num_src > rename_limit * rename_limit)
 		goto cleanup;
 
-	mx = xmalloc(sizeof(*mx) * num_create * num_src);
-	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		int base = dst_cnt * num_src;
-		struct diff_filespec *two = rename_dst[i].two;
-		if (rename_dst[i].pair)
-			continue; /* dealt with exact match already. */
-		for (j = 0; j < rename_src_nr; j++) {
-			struct diff_filespec *one = rename_src[j].one;
-			struct diff_score *m = &mx[base+j];
-			m->src = j;
-			m->dst = i;
-			m->score = estimate_similarity(one, two,
-						       minimum_score);
-			m->name_score = basename_same(one, two);
-			diff_free_filespec_blob(one);
-		}
-		/* We do not need the text anymore */
-		diff_free_filespec_blob(two);
-		dst_cnt++;
-	}
-	/* cost matrix sorted by most to least similar pair */
-	qsort(mx, num_create * num_src, sizeof(*mx), score_compare);
-	for (i = 0; i < num_create * num_src; i++) {
-		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
-		if (dst->pair)
-			continue; /* already done, either exact or fuzzy. */
-		if (mx[i].score < minimum_score)
-			break; /* there is no more usable pair. */
-		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
-		rename_count++;
-	}
-	free(mx);
+	find_approximate_renames(minimum_score);
 
  cleanup:
 	/* At this point, we have found some renames and copies and they
-- 
1.5.3.4

^ permalink raw reply related

* [PATCH/RFC 2/3] introduce generic similarity library
From: Jeff King @ 2007-10-30  4:24 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano
In-Reply-To: <20071030042118.GA14729@sigill.intra.peff.net>

This library attempts to find similarities among items
efficiently. It treats items as opaque pointers; callers are
responsible for providing an item along with its contents.

The algorithm is roughly:

  1. for each item, create a set of fingerprints for each
     chunk of the item (where each chunk is either delimited
     by a newline or is 64 characters, whichever is smaller
     -- this is the same fingerprint code from
     diffcore-delta.c). A hash stores a mapping of
     fingerprints to items, with each fingerprint having at
     most one 'source' item and one 'dest' item.

  2. for each fingerprint with a source and dest item,
     find the entry with key (source, dest) in a hash table
     and increment its value by the value of the fingerprint

  3. for each (source, dest) pair that had non-zero
     similarity, report the pair to the caller

The program test-similarity is a simple demonstration of the
code. It takes two list of files on stdin, with each file
separated by newlines and the two lists separated by a blank
line. It prints the similarity score of each non-zero pair
on stdout.

There are a few "interesting" design decisions, which should
probably be tweaked:

  - we store only one source and dest for each fingerprint
    item. We need to bound this list so that we don't get
    O(n^2) behavior for common fingerprints. This means that
    some files won't get "credit" for common fingerprints,
    which is probably OK, since those fingerprints are
    probably uninteresting. We could bound at some small
    number greater than one; one was chosen for speed and
    simplicity of implementation. We also don't store any
    overflow bit, so commonly used fingerprints will get
    assigned to whatever file is found with them last.

  - the similarity engine hands back absolute,
    non-normalized scores. That means that bigger items will
    have bigger scores, and the caller is responsible for
    normalizing. This also means that the engine cannot
    adjust normalization to account for chunks which are
    thrown out by the bounding mentioned above (i.e., if a
    file is 80 bytes, 64 of which are ignored as "common",
    then it can at most have 20% similarity (16/80) with
    another file. This means our normalized rename values
    (i.e., percentage similarity) will be lower than other
    algorithms.

Signed-off-by: Jeff King <peff@peff.net>
---
 .gitignore        |    1 +
 Makefile          |    4 +-
 similarity.c      |  152 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 similarity.h      |   24 ++++++++
 test-similarity.c |   54 +++++++++++++++++++
 5 files changed, 233 insertions(+), 2 deletions(-)
 create mode 100644 similarity.c
 create mode 100644 similarity.h
 create mode 100644 test-similarity.c

diff --git a/.gitignore b/.gitignore
index 62afef2..8ce6026 100644
--- a/.gitignore
+++ b/.gitignore
@@ -155,6 +155,7 @@ test-dump-cache-tree
 test-genrandom
 test-match-trees
 test-sha1
+test-similarity
 common-cmds.h
 *.tar.gz
 *.dsc
diff --git a/Makefile b/Makefile
index 2e6fd8f..0568d0d 100644
--- a/Makefile
+++ b/Makefile
@@ -300,7 +300,7 @@ DIFF_OBJS = \
 LIB_OBJS = \
 	blob.o commit.o connect.o csum-file.o cache-tree.o base85.o \
 	date.o diff-delta.o entry.o exec_cmd.o ident.o \
-	interpolate.o hash.o \
+	interpolate.o hash.o similarity.o \
 	lockfile.o \
 	patch-ids.o \
 	object.o pack-check.o pack-write.o patch-delta.o path.o pkt-line.o \
@@ -975,7 +975,7 @@ endif
 
 ### Testing rules
 
-TEST_PROGRAMS = test-chmtime$X test-genrandom$X test-date$X test-delta$X test-sha1$X test-match-trees$X test-absolute-path$X
+TEST_PROGRAMS = test-chmtime$X test-genrandom$X test-date$X test-delta$X test-sha1$X test-match-trees$X test-absolute-path$X test-similarity$X
 
 all:: $(TEST_PROGRAMS)
 
diff --git a/similarity.c b/similarity.c
new file mode 100644
index 0000000..0bd135c
--- /dev/null
+++ b/similarity.c
@@ -0,0 +1,152 @@
+#include "cache.h"
+#include "similarity.h"
+
+struct fingerprint_entry {
+	void *src;
+	void *dst;
+	unsigned weight;
+};
+
+struct score_entry {
+	void *src;
+	void *dst;
+	unsigned score;
+	struct score_entry *next;
+};
+
+void similarity_init(struct similarity *s)
+{
+	init_hash(&s->fingerprints);
+	init_hash(&s->scores);
+}
+
+static int free_one_fingerprint(void *ve, void *data)
+{
+	struct fingerprint_entry *e = ve;
+	free(e);
+	return 0;
+}
+
+static int free_one_score(void *ve, void *data)
+{
+	struct score_entry *e = ve;
+	while (e) {
+		struct score_entry *next = e->next;
+		free(e);
+		e = next;
+	}
+	return 0;
+}
+
+void similarity_free(struct similarity *s)
+{
+	for_each_hash(&s->fingerprints, free_one_fingerprint, NULL);
+	free_hash(&s->fingerprints);
+	for_each_hash(&s->scores, free_one_score, NULL);
+	free_hash(&s->scores);
+}
+
+static void add_fingerprint(struct hash_table *h, unsigned int fp,
+		int type, void *data, unsigned weight)
+{
+	void **pos;
+	struct fingerprint_entry *e;
+
+	pos = insert_hash(fp, h);
+	if (!*pos) {
+		e = xmalloc(sizeof(*e));
+		e->weight = weight;
+		e->src = e->dst = NULL;
+		*pos = e;
+	}
+	else
+		e = *pos;
+
+	if (type == SIMILARITY_SOURCE)
+		e->src = data;
+	else
+		e->dst = data;
+}
+
+void similarity_add(struct similarity *sim, int type, void *data,
+		const char *buf, unsigned long sz, int is_text)
+{
+	int n;
+	unsigned int accum1, accum2, hashval;
+
+	n = 0;
+	accum1 = accum2 = 0;
+	while (sz) {
+		unsigned int c = *buf++;
+		unsigned int old_1 = accum1;
+		sz--;
+
+		/* Ignore CR in CRLF sequence if text */
+		if (!is_text && c == '\r' && sz && *buf == '\n')
+			continue;
+
+		accum1 = (accum1 << 7) ^ (accum2 >> 25);
+		accum2 = (accum2 << 7) ^ (old_1 >> 25);
+		accum1 += c;
+		if (++n < 64 && c != '\n')
+			continue;
+		hashval = accum1 + accum2 * 0x61;
+		add_fingerprint(&sim->fingerprints, hashval, type, data, n);
+		n = 0;
+		accum1 = accum2 = 0;
+	}
+}
+
+static unsigned hash_void_pair(void *a, void *b)
+{
+	return (unsigned)a + (unsigned)b;
+}
+
+static int score_one_entry(void *vfp, void *vsim)
+{
+	struct fingerprint_entry *fp = vfp;
+	struct similarity *sim = vsim;
+	struct score_entry *score;
+	void **pos;
+
+	if (!fp->src || !fp->dst)
+		return 0;
+
+	pos = insert_hash(hash_void_pair(fp->src, fp->dst), &sim->scores);
+	for (score = *pos; score; score = score->next) {
+		if (score->src == fp->src && score->dst == fp->dst) {
+			score->score += fp->weight;
+			return 0;
+		}
+	}
+
+	score = xmalloc(sizeof(*score));
+	score->src = fp->src;
+	score->dst = fp->dst;
+	score->score = fp->weight;
+	score->next = *pos;
+	*pos = score;
+
+	return 0;
+}
+
+void similarity_score(struct similarity *s)
+{
+	for_each_hash(&s->fingerprints, score_one_entry, s);
+}
+
+static int report_one_score(void *ve, void *vdata)
+{
+	struct score_entry *e;
+	void (*fn)(void *, void *, unsigned) = vdata;
+
+	for (e = ve; e; e = e->next)
+		fn(e->src, e->dst, e->score);
+	return 1;
+}
+
+void similarity_report(struct similarity *s,
+		void (*fn)(void *, void *, unsigned))
+{
+	for_each_hash(&s->scores, report_one_score, fn);
+}
diff --git a/similarity.h b/similarity.h
new file mode 100644
index 0000000..6409ab8
--- /dev/null
+++ b/similarity.h
@@ -0,0 +1,24 @@
+#ifndef SIMILARITY_H
+#define SIMILARITY_H
+
+#include "hash.h"
+
+#define SIMILARITY_SOURCE 0
+#define SIMILARITY_DEST   1
+
+struct similarity {
+	struct hash_table fingerprints;
+	struct hash_table scores;
+};
+
+void similarity_init(struct similarity *s);
+void similarity_free(struct similarity *s);
+
+void similarity_add(struct similarity *s, int type, void *data,
+		const char *buf, unsigned long sz, int is_text);
+
+void similarity_score(struct similarity *s);
+void similarity_report(struct similarity *s,
+		void (*fn)(void *, void *, unsigned));
+
+#endif /* SIMILARITY_H */
diff --git a/test-similarity.c b/test-similarity.c
new file mode 100644
index 0000000..5947420
--- /dev/null
+++ b/test-similarity.c
@@ -0,0 +1,54 @@
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+
+#include "git-compat-util.h"
+#include "similarity.h"
+#include "strbuf.h"
+#include "xdiff-interface.h"
+
+static void print_rename(void *vone, void *vtwo, unsigned score) {
+	const char *one = vone, *two = vtwo;
+	printf("%u %s -> %s\n", score, one, two);
+}
+
+static void add_file(struct similarity *sim, int type, const char *fn)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int len;
+
+	len = strbuf_read_file(&sb, fn, 0);
+	if (len < 0)
+		die("unable to read %s: %s\n", fn, strerror(errno));
+
+	similarity_add(sim, type, strdup(fn),
+			sb.buf, sb.len, buffer_is_binary(sb.buf, sb.len));
+
+	strbuf_release(&sb);
+}
+
+int main(int argc, char **argv)
+{
+	struct similarity sim;
+	struct strbuf line;
+
+	similarity_init(&sim);
+	strbuf_init(&line, 0);
+
+	while (strbuf_getline(&line, stdin, '\n') != EOF) {
+		if (!line.len)
+			break;
+		add_file(&sim, SIMILARITY_SOURCE, line.buf);
+	}
+	while (strbuf_getline(&line, stdin, '\n') != EOF)
+		add_file(&sim, SIMILARITY_DEST, line.buf);
+
+	strbuf_release(&line);
+
+	similarity_score(&sim);
+	similarity_report(&sim, print_rename);
+
+	similarity_free(&sim);
+
+	return 0;
+}
-- 
1.5.3.4

^ permalink raw reply related

* [PATCH/RFC 1/3] change hash table calling conventions
From: Jeff King @ 2007-10-30  4:23 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano
In-Reply-To: <20071030042118.GA14729@sigill.intra.peff.net>

The recent hash table code has two limitations in its
calling conventions that are addressed here:

  1. Insertion either inserts a value, returning NULL, or
     returns a pointer to the previously inserted value.
     This is fine if you are making a linked list of the
     colliding values, but is awkward if your goal is to:
       a. modify the value if it already exists
       b. otherwise, allocate and insert the value
     With the old convention, you must either allocate the
     structure (and throw it away in case a), or perform two
     lookups (one to see if the entry exists, then another
     to perform the insertion).

     Instead, insertion no longer inserts any value; it
     simply returns a pointer to where you _can_ insert a
     value (which will be non-NULL if a value already
     existed).

  2. for_each_hash now allows a void 'data' pointer to be
     passed to the callback function along with each hash
     entry.

Signed-off-by: Jeff King <peff@peff.net>
---

The insertion feels kind of hack-ish. Suggestions are welcome.

 diffcore-rename.c |   14 +++++---------
 hash.c            |   18 +++++++++---------
 hash.h            |    5 +++--
 3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index f9ebea5..ba038af 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -288,7 +288,7 @@ static void free_similarity_list(struct file_similarity *p)
 	}
 }
 
-static int find_same_files(void *ptr)
+static int find_same_files(void *ptr, void *data)
 {
 	int ret;
 	struct file_similarity *p = ptr;
@@ -343,13 +343,9 @@ static void insert_file_table(struct hash_table *table, int src_dst, int index,
 	entry->next = NULL;
 
 	hash = hash_filespec(filespec);
-	pos = insert_hash(hash, entry, table);
-
-	/* We already had an entry there? */
-	if (pos) {
-		entry->next = *pos;
-		*pos = entry;
-	}
+	pos = insert_hash(hash, table);
+	entry->next = *pos;
+	*pos = entry;
 }
 
 /*
@@ -372,7 +368,7 @@ static int find_exact_renames(void)
 		insert_file_table(&file_table, 1, i, rename_dst[i].two);
 
 	/* Find the renames */
-	i = for_each_hash(&file_table, find_same_files);
+	i = for_each_hash(&file_table, find_same_files, NULL);
 
 	/* .. and free the hash data structure */
 	free_hash(&file_table);
diff --git a/hash.c b/hash.c
index 7b492d4..dc5d20e 100644
--- a/hash.c
+++ b/hash.c
@@ -33,15 +33,13 @@ static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash
  * pointers or do anything else). If it didn't exist, return
  * NULL (and the caller knows the pointer has been inserted).
  */
-static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)
+static void **insert_hash_entry(unsigned int hash, struct hash_table *table)
 {
 	struct hash_table_entry *entry = lookup_hash_entry(hash, table);
 
 	if (!entry->ptr) {
-		entry->ptr = ptr;
 		entry->hash = hash;
 		table->nr++;
-		return NULL;
 	}
 	return &entry->ptr;
 }
@@ -60,8 +58,10 @@ static void grow_hash_table(struct hash_table *table)
 	for (i = 0; i < old_size; i++) {
 		unsigned int hash = old_array[i].hash;
 		void *ptr = old_array[i].ptr;
-		if (ptr)
-			insert_hash_entry(hash, ptr, table);
+		if (ptr) {
+			void **pos = insert_hash_entry(hash, table);
+			*pos = ptr;
+		}
 	}
 	free(old_array);
 }
@@ -73,15 +73,15 @@ void *lookup_hash(unsigned int hash, struct hash_table *table)
 	return &lookup_hash_entry(hash, table)->ptr;
 }
 
-void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
+void **insert_hash(unsigned int hash, struct hash_table *table)
 {
 	unsigned int nr = table->nr;
 	if (nr >= table->size/2)
 		grow_hash_table(table);
-	return insert_hash_entry(hash, ptr, table);
+	return insert_hash_entry(hash, table);
 }
 
-int for_each_hash(struct hash_table *table, int (*fn)(void *))
+int for_each_hash(struct hash_table *table, int (*fn)(void *, void *), void *d)
 {
 	int sum = 0;
 	unsigned int i;
@@ -92,7 +92,7 @@ int for_each_hash(struct hash_table *table, int (*fn)(void *))
 		void *ptr = array->ptr;
 		array++;
 		if (ptr) {
-			int val = fn(ptr);
+			int val = fn(ptr, d);
 			if (val < 0)
 				return val;
 			sum += val;
diff --git a/hash.h b/hash.h
index a8b0fbb..f9a50da 100644
--- a/hash.h
+++ b/hash.h
@@ -29,8 +29,9 @@ struct hash_table {
 };
 
 extern void *lookup_hash(unsigned int hash, struct hash_table *table);
-extern void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table);
-extern int for_each_hash(struct hash_table *table, int (*fn)(void *));
+extern void **insert_hash(unsigned int hash, struct hash_table *table);
+extern int for_each_hash(struct hash_table *table, int (*fn)(void *, void*),
+		void *data);
 extern void free_hash(struct hash_table *table);
 
 static inline void init_hash(struct hash_table *table)
-- 
1.5.3.4

^ permalink raw reply related

* [PATCH/RFC 0/3] faster inexact rename handling
From: Jeff King @ 2007-10-30  4:21 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano

This is my first stab at faster rename handling based on Andy's code.
The patches are on top of next (to get Linus' recent work on exact
renames). Most of the interesting stuff is in 2/3.

  1/3: extension of hash interface
  2/3: similarity detection code
  3/3: integrate similarity detection into diffcore-rename

The implementation is pretty basic, so I think there is room for
code optimization (50% of the time is spent in hash lookups, so we might
be able to micro-optimize that) as well as algorithmic improvements (like the
sampling Andy mentioned).

With these patches, I can get my monster binary diff down from about 2
minutes to 17 seconds. And comparing all of linux-2.4 to all of
linux-2.6 (similar to Andy's previous demo) takes about 10 seconds.

There are a few downsides:
  - the current implementation tends to give lower similarity values
    compared to the old code (see discussion in 2/3), but this should be
    tweakable
  - on large datasets, it's more memory hungry than the old code because
    the hash grows very large. This can be helped by bumping up the
    binary chunk size (actually, the 17 seconds quoted above is using
    256-byte chunks rather than 64-byte -- with 64-byte chunks, it's
    more like 24 seconds) as well as sampling.
  - no improvement on smaller datasets. Running "git-whatchanged -M
    --raw -l0" on the linux-2.6 repo takes about the same time with the
    old and new code (presumably the algorithmic savings of the new code
    are lost in a higher constant factor, so when n is small, it is a
    wash).

-Peff

^ permalink raw reply

* Re: How to merge git://repo.or.cz/git-gui into git.git
From: Junio C Hamano @ 2007-10-30  3:46 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Yin Ping, Peter Baumann, git
In-Reply-To: <20071030005217.GT14735@spearce.org>

"Shawn O. Pearce" <spearce@spearce.org> writes:

>> 2. pull with subtree strategy
>> ~/git-gui$  git-pull -s subtree git master:master
>
> You are merging in the wrong direction.  You want to merge git-gui
> into git.git:
>
> 	git clone git://repo.or.cz/alt-git.git mygit
> 	cd mygit
> 	git pull -s subtree git://repo.or.cz/git-gui.git master
>
> So you are pulling git-gui into git, not the reverse.

Actually, subtree strategy was designed to allow merging back
and forth.  But the result, as it _is_ a merge, will not omit
any commit from the history from both branches.

^ permalink raw reply

* Re: remote#branch
From: Junio C Hamano @ 2007-10-30  3:40 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Jan Hudec, Johannes Schindelin, Petr Baudis,
	Paolo Ciarrocchi, git
In-Reply-To: <20071030030104.GK21133@thunk.org>

Theodore Tso <tytso@mit.edu> writes:

> It would probably also be a good idea to expurgate URL's from the
> documentations, because, well, they aren't URL's.  Git doesn't treat
> them like URL's, and you've said you aren't interested in changing git
> to treat them like URL's, and finally git:// isn't a registered URL
> scheme name with the IANA registration authority.  So let's not call
> them URL's, since they're clearly not.

Are you playing reverse psychology, encouraging us to switch to
RFC-conforming quoting?

^ permalink raw reply

* Re: remote#branch
From: Theodore Tso @ 2007-10-30  3:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Hudec, Johannes Schindelin, Petr Baudis, Paolo Ciarrocchi,
	git
In-Reply-To: <alpine.LFD.0.999.0710291545250.30120@woody.linux-foundation.org>

On Mon, Oct 29, 2007 at 03:57:41PM -0700, Linus Torvalds wrote:
> Sure, but "URL" in human-speak has nothing to do with an RFC.
> 
> I dislike language-lawyerese. Why the hell do people think that human 
> language should follow the RFC's?
> 
> Git addresses look like URL's, and they act like URL's, but dammit, git 
> isn't a web browser, and it's not interested in acting like one. 

The quoting rules aren't specific to a web browser; the whole point of
URL's is that they are uniform so that programs know how to handle
them without needing information specific to the URL type.  Hence the
quoting rules apply to all applications using URL's, whether it's CUPS
using a url such as: ipp://example.com/printer/tiger/bob or LDAP using
a url such as: ldap://ldap.example.com/dc=example,dc=com?postalAddress.

It's just git which is different here.  Having a uniform set of
processing rules are *useful* for applications and libraries that are
parsing URL's, not just for language-lawyer wanking.  Not that git
addresses that are layered on top of http is all that well supported
any more, but in that case we really are using an http-style URL ---
but yet git doesn't do URL quoting, because, well, it doesn't.  Yet in
that case it's very clear the http address is really a URL, and it's
arguably a defect that git doesn't handle an http address the way all
other applications handle http URL's.

At the very least, if we aren't going to change git, we should hang a
big fat sign in the documentation saying that although git location
names that begin git:// look like URL's, and smell like URL's, they
aren't treated the same way that all other applications treat URL's,
and the user shouldn't be surprised by this.  Furthermore, choosing
pathnames so that git:// and gitweb http:// addresses don't require
URL-style quoting, will probably save the user a fair amount of pain
and confusion because git refuses to treat git addresses as URL's.

It would probably also be a good idea to expurgate URL's from the
documentations, because, well, they aren't URL's.  Git doesn't treat
them like URL's, and you've said you aren't interested in changing git
to treat them like URL's, and finally git:// isn't a registered URL
scheme name with the IANA registration authority.  So let's not call
them URL's, since they're clearly not.

    	      	      	  	     - Ted

^ permalink raw reply

* Re: [PATCH 6/7] include $PATH in generating list of commands for "help -a"
From: Scott Parish @ 2007-10-30  3:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vir4ptyc9.fsf@gitster.siamese.dyndns.org>

On Mon, Oct 29, 2007 at 02:17:26PM -0700, Junio C Hamano wrote:

> It's easier to read if you briefly describe the differences
> between the replacement patch and the previous version of the
> patch below the three-dash lines.  See for example Lars Knoll's
> patch from today <200710290959.32538.lars@trolltech.com>.

Oh, that's useful information!

For the above patch, were basically what Johannes Schindelin suggested:

 + add_cmdname() now uses ALLOC_GROW and has its lines reordered to be
   somewhat cleanre
 + uniq() has lost the curly brackets
 + s/subtract_cmds/exclude_cmds/
 + exclude_cmds() uses an arg name of "path" instead of "dir"
 + exclude_cmd() no longer renames "dir" to "dirp"
 + exclude_cmds() an earlier patch moved the "struct stat" declaration
   for no-longer relevant reasons. change removed.

sRp

-- 
Scott Parish
http://srparish.net/

^ permalink raw reply

* Re: [PATCH] Correct handling of upload-pack in builtin-fetch-pack
From: Shawn O. Pearce @ 2007-10-30  2:41 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0710292232330.7357@iabervon.org>

Daniel Barkalow <barkalow@iabervon.org> wrote:
> The field in the args was being ignored in favor of a static constant
> 
> Signed-off-by: Daniel Barkalow <barkalow@iabervon.org>
> ---
> Found this while trying to figure out how builtin-fetch-pack was 
> initializing the string in the args struct, and why it generally worked 
> even though it wasn't.

Gahhh.  Yes, obviously correct fix.  Thanks!  :-)
 
> -static struct fetch_pack_args args;
> +static struct fetch_pack_args args = {
> +	/* .uploadpack = */ "git-upload-pack",
> +};
>  
>  static const char fetch_pack_usage[] =
>  "git-fetch-pack [--all] [--quiet|-q] [--keep|-k] [--thin] [--upload-pack=<git-upload-pack>] [--depth=<n>] [--no-progress] [-v] [<host>:]<directory> [<refs>...]";
> -static const char *uploadpack = "git-upload-pack";
>  
>  #define COMPLETE	(1U << 0)
>  #define COMMON		(1U << 1)
> @@ -773,7 +774,7 @@ struct ref *fetch_pack(struct fetch_pack_args *my_args,
>  			st.st_mtime = 0;
>  	}
>  
> -	pid = git_connect(fd, (char *)dest, uploadpack,
> +	pid = git_connect(fd, (char *)dest, args.uploadpack,
>                            args.verbose ? CONNECT_VERBOSE : 0);
>  	if (pid < 0)
>  		return NULL;

-- 
Shawn.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox