Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH 0/6] http-push updates
From: Nick Hengeveld @ 2006-03-14  0:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vek16udg6.fsf@assigned-by-dhcp.cox.net>

On Sun, Mar 12, 2006 at 09:21:45PM -0800, Junio C Hamano wrote:

> Repository maintenance tasks:
> 
>  - create a new repository
>  - remove an unneeded branch and tag
>  - running repack

In a DAV-only server environment, it seems like there are a few
options for supporting these tasks:

- extend http-push with additional args and/or local config settings.
  This approach would be more efficient wrt packs than separate
  push and repack steps since packs will all need to be created locally
  and then sent; a combined repack/push operation would mean that new
  objects will only be sent once as part of a pack.

- add DAV versions of git-init-db/git-branch/git-repack

- extend git-init-db/git-branch/git-repack to be DAV-aware

I like option #1.

>  - create new branch (and new tag) -- I think you can already do this

Right - you can create locally and then push that branch/tag or
--all/--tags.

>  - (perhaps) running update-server-info

http-push already updates info/refs if it existed before the push 
(perhaps that behavior should also be based on a local config setting.)
I would plan to add support for updating objects/info/packs along with
pack/repack support.  That should be all the server-info there is to
update, right?

-- 
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.

^ permalink raw reply

* Re: What should I use instead of git show?
From: Junio C Hamano @ 2006-03-13 23:55 UTC (permalink / raw)
  To: Mark Hollomon; +Cc: git
In-Reply-To: <4415FFB8.3000001@comcast.net>

Mark Hollomon <markhollomon@comcast.net> writes:

> I must be misunderstanding this:
>
> 	git whatchanged -p -1 HEAD
>
> in the current git tree results in nothing. only when I get to -5 does it show something.
>
> Is this expected?
>
>> git version
> git version 1.2.4.gea75

In this case what matterks is not the version of your git but
what that HEAD is.  If it is a merge commit, whatchanged -p does
not show anything by default.

^ permalink raw reply

* Re: What should I use instead of git show?
From: Mark Hollomon @ 2006-03-13 23:26 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.64.0603130830050.3618@g5.osdl.org>

Linus Torvalds wrote:
> 
> 
> 	git whatchanged -p -1 <sha1>
> 
> instead (actually, if your git is really old, you shouldn't use the modern 
> shorthand of "-1", you should use the longer "--max-count=1" instead).

I must be misunderstanding this:

	git whatchanged -p -1 HEAD

in the current git tree results in nothing. only when I get to -5 does it show something.

Is this expected?

 > git version
git version 1.2.4.gea75

-- 
Mark Hollomon

^ permalink raw reply

* make -d work in git-repack (without -a)
From: Alex Riesen @ 2006-03-13 22:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Signed-off-by: Alex Riesen <raa.lkml@gmail.com>

---

Junio C Hamano, Thu, Mar 09, 2006 19:50:43 +0100:
> I am inclined to say I prefer Alex' one.

I guess it has to be sent in formally...

 git-repack.sh |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

a594bee1d539f71970e321592f45a114ea648d92
diff --git a/git-repack.sh b/git-repack.sh
index bc90112..2f643b5 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -74,6 +74,8 @@ then
 			esac
 		  done
 		)
+	else
+		git-prune-packed
 	fi
 	git-prune-packed
 fi
-- 
1.2.4.g6ec337

^ permalink raw reply related

* Re: Possible --remove-empty bug
From: Linus Torvalds @ 2006-03-13 19:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Marco Costalba, git
In-Reply-To: <7v4q22ucio.fsf@assigned-by-dhcp.cox.net>

On Sun, 12 Mar 2006, Junio C Hamano wrote:
> 
> It removes the grandparents from the parent, and leaves the
> parent still interesting.  As a result, in your example:
> 
>     ... if you have
> 
>                 a
>                / \
>               b   c
>                \ /
>                 d
> 
>     where the pathname disappeared in "b"...
> 
> we would get this world view:
> 
>                 a
>                / \
>               b   c
>                  /
>                 d

Yeah, that's correct. That way you still see all the history that is 
relevant to the tree that became empty.

However, to be honest, the only reason to ever use --remove-empty is for 
rename detection, and Frederik's approach of doing that through the 
library interface directly is actually a much superior option. So we might 
as well drop the compilcation of --remove-empty entirely, unless somebody 
has already started using it.

The _real_ optimization would be to make the pathname based pruning be 
done incrementally instead of having to build up the whole tree. That 
would be much more important than the --remove-empty stuff from a 
usability standpoint. I'm absolutely sure it's possible, and I simplified 
the code earlier so that it should be simpler to do, but every time I 
actually look at the code I get confused again.

		Linus

^ permalink raw reply

* Re: What should I use instead of git show?
From: Olivier Galibert @ 2006-03-13 16:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Radoslaw Szkodzinski, git
In-Reply-To: <Pine.LNX.4.64.0603130830050.3618@g5.osdl.org>

On Mon, Mar 13, 2006 at 08:33:22AM -0800, Linus Torvalds wrote:
> Why not just use "git show"?
> 
> It hasn't gone anywhere that I know of. It's still there.

Oh beautiful.

I had an old (as in 3 months max, but heh) git-tree at work in a
corner from which I had compiled git.  When I saw it didn't have git
show which I have at home I did a git pull, recompile, reinstall.

The new git didn't have git show, so I thought it had been removed
after all.  Turns out, the git pull had broken halfway due to the old
version of git.  It hadn't fast forwarded _all_ the versions.  But the
new git, while not current, has been able to do the complete git pull
this time.  And now I have git show at work too.

Guess you have to update git every month or so if you want to be able
to follow current trees.

  OG.

^ permalink raw reply

* Re: What should I use instead of git show?
From: Olivier Galibert @ 2006-03-13 16:50 UTC (permalink / raw)
  To: Radoslaw Szkodzinski; +Cc: git
In-Reply-To: <200603131717.53416.astralstorm@o2.pl>

On Mon, Mar 13, 2006 at 05:17:47PM +0100, Radoslaw Szkodzinski wrote:
> On Monday 13 March 2006 15:47, Olivier Galibert wrote yet:
> > Since it seems to be gone.
> >
> > Up until now, when I wanted to send a patch to someone with the
> > associated changelog, I just did a git log to find the changelog sha1
> > then a git show to get the goods.  How am I supposed to do that now?
> 
> Why not use git-whatchanged? It's exacly meant to do this.

Indeed, git-whatchanged -p origin..HEAD worked.  Thanks.

> Or try qgit, or even gitk (which is what git show did).

gitk does not seem to have an export function.  Dunno about qgit.

  OG.

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: Olivier Galibert @ 2006-03-13 16:37 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, git, Mark Wooding
In-Reply-To: <44151330.7020905@zytor.com>

On Sun, Mar 12, 2006 at 10:37:36PM -0800, H. Peter Anvin wrote:
> On "real" machines, the biggest reason you'd care is that a lot of 
> compilers, *especially* in C++ mode, really still define NULL as "0"; 
> ostensibly because defining it as "((void *)0)" breaks some obscure C++ 
> casting rule.

Not obscure, just a religious issue.  Somehow in the creation of the
C++ standard the definition of void * got changed from "generic
pointer" to something else I've been unable to fathom.  That
definition, whatever it is, justifies forbidding implicit casts from
void * to anything else.  Some of the priests of the new definition
consider the existence in C of a usable generic pointer type to be a
failing of the language too.

  OG.

^ permalink raw reply

* Re: What should I use instead of git show?
From: Linus Torvalds @ 2006-03-13 16:33 UTC (permalink / raw)
  To: Radoslaw Szkodzinski; +Cc: git, Olivier Galibert
In-Reply-To: <200603131717.53416.astralstorm@o2.pl>

On Mon, 13 Mar 2006, Radoslaw Szkodzinski wrote:
> On Monday 13 March 2006 15:47, Olivier Galibert wrote yet:
> > Since it seems to be gone.
> >
> > Up until now, when I wanted to send a patch to someone with the
> > associated changelog, I just did a git log to find the changelog sha1
> > then a git show to get the goods.  How am I supposed to do that now?
> >
> >   OG.
> 
> Why not use git-whatchanged? It's exacly meant to do this.
> Or try qgit, or even gitk (which is what git show did).

Why not just use "git show"?

It hasn't gone anywhere that I know of. It's still there.

Are you stuck with an older version of git that doesn't have it? If so, 
you can indeed do

	git whatchanged -p -1 <sha1>

instead (actually, if your git is really old, you shouldn't use the modern 
shorthand of "-1", you should use the longer "--max-count=1" instead).

		Linus

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: Linus Torvalds @ 2006-03-13 16:26 UTC (permalink / raw)
  To: Horst von Brand; +Cc: git
In-Reply-To: <200603130414.k2D4EXcX011651@laptop11.inf.utfsm.cl>

On Mon, 13 Mar 2006, Horst von Brand wrote:
>> 
> > This breaks down with variadic functions, which have no typing
> > information. So doing this:
> >   execl("foo", "bar", my_struct_foo);
> > doesn't give the compiler a chance to do the implicit cast and you get
> > subtle breakage (in the same way that you would if you passed a long to
> > a variadic function expecting a short).
> 
> It just passes 3 "void *"'s, and casts back. What is so strange?

It doesn't actually pass 3 "void *".

Variadic functions pass their arguments as-is (apart from the normal 
integer promotion for small integer types: chars are passed as integers if 
"char" is smaller than "int").

So no actual casting takes place for the arguments, and

	execl("foo", "bar", my_struct_foo, NULL);

passes in four pointers: two of type "char *", one of whatever 
my_struct_foo is ("struct foo *") and finally hopefully one of "void *" 
(modulo broken compilers). They might - in theory - have different sizes 
and representations.

Now, the interesting effect of this is that simple things like

	printf("Pointer value %p\n", myfunction);

is actually NOT STRICTLY PORTABLE CODE! Why? Because "%p" wants a void 
pointer, but "myfunction" is of a different pointer type, which may 
actually have a different size and a different representation entirely, so 
you can get total garbage printed out.

So you have two choices:

 - be sane, and ignore insane architectures. In this case, the above is 
   perfectly fine C code.

 - be insane, and care about it. In this case, you really do have to add 
   the casts to be safe in theory.

The "%p" for printf() is actually a wonderful example of why you really 
really really should ignore language lawyers. According to language 
lawyers, you should add that "(void *)" cast. But look around for how many 
such casts you can find in real code, and realize that the language 
lawyers just don't matter. A C compiler environment that requires it is 
simply broken, and sane people will refuse to use it for anything than 
small embedded work, because it's simply not usable.

So while it's not true in theory, in _practice_ you should expect all 
pointers to be of the same size and use the same representation. Anything 
else is just too painful to be ever worth bothering with.

(Remember near and far pointers from 16-bit DOS/Windows? Those 
environments at least had _explicit_ pointer sizes, making the problem 
less horrible, but that was clearly a disaster nonetheless. Pointers that 
implicitly have different sizes because they point to different types are 
even _worse_).

			Linus

PS. Final words: while knowing that all pointer representations must be 
the same, and that NULL actually must be the bitwise binary "all zero" 
value (otherwise "memset(structptr, 0, structsize)" would break), it's 
equally important to know that the standard doesn't _guarantee_ it. Why? 

Because that way, when somebody tries to tell you that your code isn't 
standards conformant and might be unportable, you can say "yeah, I know, 
but only a total loser would ever care". Instead of incorrectly trying to 
argue that your code is "correct".

^ permalink raw reply

* Re: What should I use instead of git show?
From: Radoslaw Szkodzinski @ 2006-03-13 16:17 UTC (permalink / raw)
  To: git; +Cc: Olivier Galibert
In-Reply-To: <20060313144747.GA81092@dspnet.fr.eu.org>

[-- Attachment #1: Type: text/plain, Size: 530 bytes --]

On Monday 13 March 2006 15:47, Olivier Galibert wrote yet:
> Since it seems to be gone.
>
> Up until now, when I wanted to send a patch to someone with the
> associated changelog, I just did a git log to find the changelog sha1
> then a git show to get the goods.  How am I supposed to do that now?
>
>   OG.

Why not use git-whatchanged? It's exacly meant to do this.
Or try qgit, or even gitk (which is what git show did).

--
GPG Key id:  0xD1F10BA2
Fingerprint: 96E2 304A B9C4 949A 10A0  9105 9543 0453 D1F1 0BA2

AstralStorm

[-- Attachment #2: Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: Horst von Brand @ 2006-03-12 21:51 UTC (permalink / raw)
  To: Mark Wooding; +Cc: git
In-Reply-To: <slrne18of5.fr9.mdw@metalzone.distorted.org.uk>

Mark Wooding <mdw@distorted.org.uk> wrote:
> Linus Torvalds <torvalds@osdl.org> wrote:
> > So in modern C, using NULL at the end of a varargs array as a pointer is 
> > perfectly sane, and the extra cast is just ugly and bowing to bad 
> > programming practices and makes no sense to anybody who never saw the 
> > horror that is K&R.

> No!  You can still get bitten.

Only if the compiler is completely broken.

>                                 You're lucky that on common platforms
> all pointers look the same, but if you find one where `char *' (and
> hence `void *') isn't the same as `struct foo *' then, under appropriate
> circumstances you /will/ unless you put the casts in.

Show one platform where this is true...

> Now, maybe we don't care for GIT.  That's your (and Junio's) call.  My
> natural approach is to work as closely as I can to the specs (and then
> throw in hacks for platforms which /still/ don't work), though, which is
> why I brought the subject up.

If on the platform you use it doesn't work, change the compiler. Fast.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply

* Re: [PATCH 1/4] Simplify wildcards for match files to be ignored
From: Horst von Brand @ 2006-03-11  0:08 UTC (permalink / raw)
  To: Jonas Fonseca; +Cc: Petr Baudis, git
In-Reply-To: <20060310144308.GB7920@diku.dk>

Jonas Fonseca <fonseca@diku.dk> wrote:
> Signed-off-by: Jonas Fonseca <fonseca@diku.dk>
> 
> ---
> 
>  Documentation/Makefile |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index 3aad2fb..661c259 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -1,4 +1,4 @@
> -CG_IGNORE=$(wildcard ../cg-X* ../cg-*.orig ../cg-*.rej ../cg-*.in)
> +CG_IGNORE=$(wildcard ../cg-X* ../cg-*.*)

Nope. Better be specific in what you delete. It is not exactly
performance-critical...
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply

* Re: [PATCH] Use explicit pointers for execl...() sentinels.
From: Horst von Brand @ 2006-03-13  4:12 UTC (permalink / raw)
  To: git
In-Reply-To: <20060313033121.GA14601@coredump.intra.peff.net>

Jeff King <peff@peff.net> wrote:
> On Sun, Mar 12, 2006 at 08:08:12PM +0200, Timo Hirvonen wrote:
> > NULL pointer does not point to any data, it just says it's 'empty'.  So
> > it doesn't need to be same type pointer as specified in the function
> > prototype.  Pointers are just addresses, it doesn't matter from to code
> > generation point of view whether it is (char *)0 or (void *)0.

> Sorry, but I think you're wrong according to the C standard. Pointers of 
> different types do NOT have to share the same representation (e.g.,
> there have been some platforms where char* and int* were different
> sizes). A void pointer must be capable of representing any type of
> pointer (for example, holding the largest possible type). However, if
> sizeof(void *) == 8 and sizeof(char *) == 4,

Very improbable, they'll be the same normally ("void *" is a way of getting
rid of the overloading of the meaning of "char *" for this before ANSI C).
Sure, sizeof(int *) might be 4, but I think that is pretty far off.

>                                              you have a problem with
> variadic functions which are expecting to pull 4 byte off the stack. 

There are special rules for variadic functions, probably pointers would be
cast to/from void * in such a case by the compiler.

> In a non-variadic function, the compiler would do the right implicit
> casting. In a variadic function, it can't. 

It sure can. The rules where defined so that it works.

> The real question is, does git want to care about portability to such
> platforms.

Broken platform, on which the compiler fails miserably in doing its job?
No, it doesn't.

> If you remain unconvinced, I can try to find chapter and verse of the
> standard.

Please do.

> > sizeof(unsigned long) is sizeof(void *) in real world.

> Are you saying that because it encompasses all of the platforms you've
> worked on, or do you have some evidence that it is largely the case? It
> certainly isn't guaranteed by the C standard.

More because a machine with pointers that are much larger than the largest
"normal" integer would be pretty weird (sure, on intel 8086 they where 32
("far" pointer, segment + offset) and 16 bits, but...
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: Horst von Brand @ 2006-03-13  4:14 UTC (permalink / raw)
  To: git
In-Reply-To: <20060313033805.GB14601@coredump.intra.peff.net>

Jeff King <peff@peff.net> wrote:

[...]

> A void pointer is guaranteed to be able to hold any type of pointer
> (either char * or struct foo * or whatever). The declaration of malloc
> indicates a return of void *. On a platform where it matters, the
> compiler generates code so that 
>   struct foo *bar = malloc(100);
> converts the void * pointer into the correct size (in the same way that
> assigning between differently sized integers works).

Right.

> This breaks down with variadic functions, which have no typing
> information. So doing this:
>   execl("foo", "bar", my_struct_foo);
> doesn't give the compiler a chance to do the implicit cast and you get
> subtle breakage (in the same way that you would if you passed a long to
> a variadic function expecting a short).

It just passes 3 "void *"'s, and casts back. What is so strange?
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Linus Torvalds @ 2006-03-13 15:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vzmjupqv0.fsf@assigned-by-dhcp.cox.net>

On Mon, 13 Mar 2006, Junio C Hamano wrote:
> 
> By the way, the reason the diffcore-delta code in "next" does
> not do every-eight-bytes hash on the source material is to
> somewhat alleviate the problem that comes from not detecting
> copying of consecutive byte ranges.

Yes. However, there are better ways to do that in practice.

The most effective way that is generally used is to not use a fixed 
chunk-size, but use a terminating character, together with a 
minimum/maximum chunksize.

There's a pretty natural terminating character that works well for 
sources: '\n'.

So the natural way to do similarity detection when most of the code is 
line-based is to do the hashing on chunks that follow the rule "minimum of 
<n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".

So if you don't see any '\n' at all (or the only such one is less than <n> 
bytes into your current window), do the hash over a <2n>-byte chunk (this 
takes care of binaries and/or long lines).

This - for source code - allows you to ignore trivial byte offset things, 
because you have a character that is used for synchronization. So you 
don't need to do hashing at every byte in both files - you end up doing 
the hashing only at line boundaries in practice. And it still _works_ for 
binary files, although you effectively need bigger identical chunk-sizes 
to find similarities (for text-files, it finds similarities of size <n>, 
for binaries the similarities need to effectively be of size 3*n, because 
you chunk it up at ~2*n, and only generate the hash at certain offsets in 
the source binary).

		Linus

^ permalink raw reply

* What should I use instead of git show?
From: Olivier Galibert @ 2006-03-13 14:47 UTC (permalink / raw)
  To: git

Since it seems to be gone.

Up until now, when I wanted to send a patch to someone with the
associated changelog, I just did a git log to find the changelog sha1
then a git show to get the goods.  How am I supposed to do that now?

  OG.

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Junio C Hamano @ 2006-03-13 10:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0603122316160.3618@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Instead of doing a fixed-chunk thing and saying that any copy is 
> equivalent to any other copy. That's simply not true. It's _much_ better 
> to have one 24-byte copy than it is to have three 8-byte copies, but the 
> new faster diffcore-delta.c just can't see that.

Exactly.

You know what?  Once we start counting to detect 24-byte
straight copy and try distinguishing it from 3 separate 8-byte
copies, it will eventually lead us to what we have in
diff-delta.c anyway.  I avoided counting runs of bytes on
purpose because I wanted to see how far we can go without it.

The primary reason I started the jc/diff topic branch was
because we _might_ want to replace what is in the current
diff-delta.c with much finer-grained comparison code, and when
that happens, counting xdelta output for rename detection
purpose would have stopped making sense.  For now we decided to
postpone it for performance reasons, but we still might want to
when Nico comes back with a better implementation.

Now, I know the current diff-delta based similarity estimator we
have in "main" seems to do a reasonable if not perfect job,
within a reasonabe amount of time.  And it does know how to
count copying of consecutive bytes.  In the worst case we could
just fork the xdelta part of the code when Nico comes back with
improved finer-grained delta, and we can keep using the current
diff-delta code for rename detection.  Knowing we have that
fallback position, I wanted to pursue a different avenue.
Distinguishing a straight 24-byte run from three independent
8-byte run, using hash to find the offset in the source and
actually do maximum string match, is something we already know
how to do, because that is essentially what the current
diff-delta code does.

By the way, the reason the diffcore-delta code in "next" does
not do every-eight-bytes hash on the source material is to
somewhat alleviate the problem that comes from not detecting
copying of consecutive byte ranges.  If you have a 8-byte run
that is copied from source to destination, we would give it one
point (let's for now forget about false match coming from hash
collisions).  Since the source material is hashed at every byte
offset, if we have 9-byte run copied from source to destination,
that is awarded two points (for the first 8-byte we award one
point, and then another 8-byte sequence starting from the second
byte we award another point; we are talking about an overlapping
range).  That way, the code does reward copying consecutive
bytes around more heavily than copying things at random places.
At one extreme, if you copy 7-byte, throw in a garbage, another
7-byte, throw in a garbage, and keep going, you would not get
any point.

It's really a funky heuristics, and as you have seen, it
sometimes gives spectaculary phony matches.  But in practice,
with some tweaking it seems to do an OK job.

^ permalink raw reply

* Direct CVS import tool
From: Keith Packard @ 2006-03-13  8:25 UTC (permalink / raw)
  To: Git Mailing List; +Cc: keithp

[-- Attachment #1: Type: text/plain, Size: 1401 bytes --]

I've got some rather large, broken, CVS trees that I'm trying to migrate
to git (X.org).

Attempts to do this with the existing cvsps-based git-cvsimport have
proved rather disasterous; missing files, incorrect log messages and
incorrect revisions are spread throughout the tree, including on the tip
of each branch.

I'm reasonably sure the problems are caused by cvsps, and while I've
hacked at that quite a bit, it seems like it's so focused on analysing
the tree for putative software engineering reesarch that it cannot be
made to accurately reproduce the tree via changeset analysis.

So, I've given up hacking cvsps and wrote a simple RCS file parser that
directly reads ,v files into a git-like revlist structure, and then
merges those together into a final tree. At this point, it's generating
the right head information for every named branch, but it's still not
successfully connecting all of the branches back together at suitable
points.

It's also not yet generating actual git trees; instead it dumps the
generated revision tree structure to a graphviz file for visual
inspection of the results.

But, I figured instead of doing this work in secret, I'd let people know
what I'm up to in case others want to play along.

git://git.freedesktop.org/~keithp/parsecvs

And, of course ideas for a suitable name would be welcome.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Linus Torvalds @ 2006-03-13  7:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0603122256550.3618@g5.osdl.org>

On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> Now, that said, they _both_ find some pretty funky renames. I think there 
> is probably some serious room for improvement, regardless (or at least 
> changing the default similarity cut-off to something better ;)

I'm afraid that _good_ rename detection really ends up wanting to take 
"longest possible sequence" into account, exactly like the full xdelta 
does. 

Instead of doing a fixed-chunk thing and saying that any copy is 
equivalent to any other copy. That's simply not true. It's _much_ better 
to have one 24-byte copy than it is to have three 8-byte copies, but the 
new faster diffcore-delta.c just can't see that.

So one big reason as to why it is fast in the first place is that it 
fundamentally just doesn't do a very good job ;(

It might be that the fast delta thing is a good way to ask "is this even 
worth considering", to cut down the O(m*n) rename/copy detection to 
something much smaller, and then use xdelta() to actually figure out what 
is a good rename and what isn't from a much smaller set of potential 
targets.

That would actually allow us to be even _less_ precise. Screw that big 
hash-table etc, don't even try to be exact. Just try to be fairly fast, 
and then pick the top entries from the similarity array for more precise 
diffing if there are multiple choices that look like they might be 
possible.

The appended alternate "diffcore-delta.c" doesn't do any of the caching 
(ie I wrote it so that it would be easy to change to make the _caller_ 
keeps "src" constant, and iterates over destination - or the other way 
around - and would do the hash setup just once per src).

Still, even with the existing setup, it's pretty fast for me (not much 
slower than your caching version even though it recalculates everything 
every time). And it's not that far off, which tells me that if it was used 
as a "first-pass filter", we could afford to do a better job on the things 
that it says are likely candidates.

Hmm? It really does bother me how the suggested rename detector finds 
stuff that clearly isn't. 

			Linus

----
#include "cache.h"
#include "diff.h"
#include "diffcore.h"

#define CHUNK (16)
#define SILLYSIZE (65537)
static int hashnr[SILLYSIZE];

static void setup_hash(void)
{
	memset(hashnr, 0, sizeof(hashnr));
}

static void insert_hash(unsigned int hashval)
{
	hashval = hashval % SILLYSIZE;
	hashnr[hashval]++;
}

static int find_hash(unsigned int hashval)
{
	hashval = hashval % SILLYSIZE;
	if (hashnr[hashval]) {
		hashnr[hashval]--;
		return 1;
	}
	return 0;
}

int diffcore_count_changes(void *src, unsigned long src_size,
			   void *dst, unsigned long dst_size,
			   void **src_count_p,
			   void **dst_count_p,
			   unsigned long delta_limit,
			   unsigned long *src_copied,
			   unsigned long *literal_added)
{
	unsigned long copied = 0;
	unsigned long literal = 0;

	setup_hash();
	while (src_size >= CHUNK) {
		unsigned int hashval = adler32(0, src, CHUNK);
		insert_hash(hashval);
		src += CHUNK;
		src_size -= CHUNK;
	}

	while (dst_size >= CHUNK) {
		unsigned int hashval = adler32(0, dst, CHUNK);
		if (find_hash(hashval)) {
			copied += CHUNK;
			dst += CHUNK;
			dst_size -= CHUNK;
			continue;
		}
		literal++;
		if (literal > delta_limit)
			return -1;
		dst++;
		dst_size--;
	}
	literal += dst_size;

	*src_copied = copied;
	*literal_added = literal;
	return 0;
}

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Junio C Hamano @ 2006-03-13  7:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0603122256550.3618@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Just compare the result...
>
> Now, that said, they _both_ find some pretty funky renames. I think there 
> is probably some serious room for improvement, regardless (or at least 
> changing the default similarity cut-off to something better ;)

Yes.  The "compare with larger" seems to cull nonsensical ones
found by "next" one much better.

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Linus Torvalds @ 2006-03-13  7:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List
In-Reply-To: <7vmzfusuyq.fsf@assigned-by-dhcp.cox.net>



On Sun, 12 Mar 2006, Junio C Hamano wrote:
>
> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > The "score" calculation for diffcore-rename was totally broken.
> >
> > It scaled "score" as
> >
> > 	score = src_copied * MAX_SCORE / dst->size;
> >
> > which means that you got a 100% similarity score even if src and dest were 
> > different, if just every byte of dst was copied from src, even if source 
> > was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
> > the remaining 15%).
> 
> Your reading of the code is correct, but that is deliberate.
> 
> >  	/* How similar are they?
> >  	 * what percentage of material in dst are from source?
> >  	 */
> 
> I wanted to say in such a case that dst was _really_ derived
> from the source.  I think using max may make more sense, but I
> need to convince myself by looking at filepairs that this change
> stops detecting as renames, and this change starts detecting as
> renames.

Just compare the result. Just eye-balling the difference between the 
rename data from 2.6.12 to 2.6.14, the fixed score actually gets better 
rename detection. It actually finds 133 renames (as opposed to 132 for the 
broken one), and the renames it finds are more sensible.

For example, the fixed version finds

	drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h

which actually matches the other i2c/chips/ renames, while the broken one 
does

	drivers/i2c/chips/lm75.h -> drivers/media/video/rds.h

which just doesn't make any sense at all.

Now, that said, they _both_ find some pretty funky renames. I think there 
is probably some serious room for improvement, regardless (or at least 
changing the default similarity cut-off to something better ;)

		Linus

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Junio C Hamano @ 2006-03-13  6:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0603122223160.3618@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> The "score" calculation for diffcore-rename was totally broken.
>
> It scaled "score" as
>
> 	score = src_copied * MAX_SCORE / dst->size;
>
> which means that you got a 100% similarity score even if src and dest were 
> different, if just every byte of dst was copied from src, even if source 
> was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
> the remaining 15%).

Your reading of the code is correct, but that is deliberate.

>  	/* How similar are they?
>  	 * what percentage of material in dst are from source?
>  	 */

I wanted to say in such a case that dst was _really_ derived
from the source.  I think using max may make more sense, but I
need to convince myself by looking at filepairs that this change
stops detecting as renames, and this change starts detecting as
renames.

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: Linus Torvalds @ 2006-03-13  6:46 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: git, Mark Wooding
In-Reply-To: <44151330.7020905@zytor.com>

On Sun, 12 Mar 2006, H. Peter Anvin wrote:
>
> On "real" machines, the biggest reason you'd care is that a lot of compilers,
> *especially* in C++ mode, really still define NULL as "0"; ostensibly because
> defining it as "((void *)0)" breaks some obscure C++ casting rule.

Agreed. gcc has fixed that rule, but others have not. Don't compile git 
as C++.

		Linus

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Linus Torvalds @ 2006-03-13  6:44 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0603122223160.3618@g5.osdl.org>

On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> The "score" calculation for diffcore-rename was totally broken.
> 
> It scaled "score" as
> 
> 	score = src_copied * MAX_SCORE / dst->size;
> 
> which means that you got a 100% similarity score even if src and dest were 
> different, if just every byte of dst was copied from src, even if source 
> was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
> the remaining 15%).
> 
> That's clearly bogus. We should do the score calculation relative not to 
> the destination size, but to the max size of the two.
> 
> This seems to fix it.

Btw, interestingly, this seems to actually improve on the rename 
detection from your previous one, even though at the face of it, it 
should just have made the scores go down.

I'm not quite sure why, but perhaps it gave a bogus high score to some 
rename that wasn't very good, allowing the _real_ rename to make itself 
seen.

Or maybe I did some mistake in testing it.

		Linus

PS. You can still get a "similarity score" of 100 with the fixed scaling 
even if the source and the destination were different. That happens if 
every byte was marked as "copied" by the similarity estimator. Which can 
happen if you just move things around in the file - the end result is 
different, but all the bytes are copied from the source.

At least with the fixed heuristic, that "perfect similarity" score can be 
_somehow_ be explained. The files are very similar in that they have the 
same content, just in a different order ;)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox