Git development
 help / color / mirror / Atom feed
* Re: [RFH] shifting xdiff hunks?
From: Junio C Hamano @ 2006-04-13 23:31 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604131452250.10564@alien.or.mcafeemobile.com>

Davide Libenzi <davidel@xmailserver.org> writes:

> On Wed, 12 Apr 2006, Davide Libenzi wrote:
>
>> Yes, this is what GNU diff does. It's a post-process of the edit
>> script. Not a problem at all. Till this weekend (included) I'm
>> pretty booked, but I'll do that in the following days.
>
> Dang, that was a short weekend. I found a lunch-time hour for
> this. Would you try to see if this libxdiff-based diff merges on your
> tree?
> See also how it looks for you.

Very impressed, and pleased with the result.  I've only taken a
cursory look, but with a very limited number of tests, it looks
much better.  Thanks.

For the sake of full disclosure, the reason I wanted consistency
was not for the diff output I quoted earlier, but to help making
the combined patch output cleaner.  It does reduce false match
from the infamous 12-way Octopus by Len Brown:

	git diff-tree --cc 9fdb62af92c741addbea15545f214a6e89460865

^ permalink raw reply

* Re: [PATCH] Shell utilities: Guard against expr' magic tokens.
From: Junio C Hamano @ 2006-04-13 23:39 UTC (permalink / raw)
  To: Mark Wooding; +Cc: git
In-Reply-To: <slrne3tihk.1dq.mdw@metalzone.distorted.org.uk>

Mark Wooding <mdw@distorted.org.uk> writes:

> From: Mark Wooding <mdw@distorted.org.uk>
>
> Some words, e.g., `match', are special to expr(1), and cause strange
> parsing effects.  Track down all uses of expr and mangle the arguments
> so that this isn't a problem.

Gaaaaaaaaaah.  

    http://www.opengroup.org/onlinepubs/009695399/utilities/expr.html

says use of length, substr, index, match as string arguments
produces unspecified results, so obviously the program was
wrong.

Thanks.

^ permalink raw reply

* [PATCH] Fix-up previous expr changes.
From: Junio C Hamano @ 2006-04-14  2:12 UTC (permalink / raw)
  To: git; +Cc: Mark Wooding
In-Reply-To: <slrne3tihk.1dq.mdw@metalzone.distorted.org.uk>

The regexp on the right hand side of expr : operator somehow was
broken.

	expr 'z+pu:refs/tags/ko-pu' : 'z\+\(.*\)'

does not strip '+'; write 'z+\(.*\)' instead.

We probably should switch to shell based substring post 1.3.0;
that's not bashism but just POSIX anyway.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 * Funny thing is that before the z prefixing, the code was
   already broken (we said expr "$ref" : '\+\(.*\)'), but
   somehow it worked.  It could be a bug in expr.

	# already buggy but did not trigger somehow.
        : siamese; expr '+pu:ko-pu' : '\+\(.*\)'
        pu:ko-pu
        # z prefix exposed the breakage.
        : siamese; expr 'z+pu:ko-pu' : 'z\+\(.*\)'
        +pu:ko-pu
        # the fix-up this patch is about.
        : siamese; expr 'z+pu:ko-pu' : 'z+\(.*\)'
        pu:ko-pu
        # this is the way it should have been written from the start.
        : siamese; expr '+pu:ko-pu' : '+\(.*\)'
        pu:ko-pu
        # maybe I am using broken expr...
        : siamese; type expr
        expr is hashed (/usr/bin/expr)
        : siamese; /usr/bin/expr --version |head -n2
        expr (GNU coreutils) 5.94
        Copyright (C) 2006 Free Software Foundation, Inc.

 git-fetch.sh        |    4 ++--
 git-parse-remote.sh |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

dfdcb558ecf93c0e09b8dab89cff4839e8c95e36
diff --git a/git-fetch.sh b/git-fetch.sh
index 711650f..83143f8 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -252,10 +252,10 @@ fetch_main () {
       else
 	  not_for_merge=
       fi
-      if expr "z$ref" : 'z\+' >/dev/null
+      if expr "z$ref" : 'z+' >/dev/null
       then
 	  single_force=t
-	  ref=$(expr "z$ref" : 'z\+\(.*\)')
+	  ref=$(expr "z$ref" : 'z+\(.*\)')
       else
 	  single_force=
       fi
diff --git a/git-parse-remote.sh b/git-parse-remote.sh
index 65c66d5..c9b899e 100755
--- a/git-parse-remote.sh
+++ b/git-parse-remote.sh
@@ -77,7 +77,7 @@ canon_refs_list_for_fetch () {
 		force=
 		case "$ref" in
 		+*)
-			ref=$(expr "z$ref" : 'z\+\(.*\)')
+			ref=$(expr "z$ref" : 'z+\(.*\)')
 			force=+
 			;;
 		esac
-- 
1.3.0.rc3.gce03

^ permalink raw reply related

* Solaris test t5500 race condition
From: Peter Eriksen @ 2006-04-14  3:17 UTC (permalink / raw)
  To: git

Hello,

I've found a race in t5500-fetch-pack.sh.  The problem is the way the
number of unpacked objects are counted:

    pack_count=$(grep Unpacking log.txt|tr -dc "0-9")

It just concatenates all the digits on the line with "Unpacking" in it. 
This is the output I get on Solaris:

    Generating pack...
    Done counting 3 objects.
    Deltifying 3 objects.
      33% (1/3) done^M  66% (2/3) done^M 100% (3/3) done
    Total 3Unpacking , written 33 objects          <------------
     (delta 0), reused 0 (delta 0)
    11fa2f0cb58ed7f02dbd5ac75ed82a53fae62a7b refs/heads/A

The marked line is written as a joyful duet between these
two functions:

    unpack-objects.c:   fprintf(stderr, "Unpacking %d objects\n",
                                nr_objects);

    pack-objects.c:     fprintf(stderr, "Total %d, written %d 
                                (delta %d), reused %d (delta %d)\n",

I can't think of a good solution right now.

Regards,

Peter

^ permalink raw reply

* [PATCH] diff --stat: no need to ask funcnames nor context.
From: Junio C Hamano @ 2006-04-14  4:37 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <Pine.LNX.4.63.0604140012560.10924@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Now, you can say "git diff --stat" (to get an idea how many changes are
> uncommitted), or "git log --stat".
>     
> Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>

Nice.

-- >8 --
Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 diff.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

84981f9ad963f050abf4fe33ac07d36b4ea90c6d
diff --git a/diff.c b/diff.c
index c120239..f1b672d 100644
--- a/diff.c
+++ b/diff.c
@@ -438,8 +438,8 @@ static void builtin_diffstat(const char 
 		xdemitcb_t ecb;
 
 		xpp.flags = XDF_NEED_MINIMAL;
-		xecfg.ctxlen = 3;
-		xecfg.flags = XDL_EMIT_FUNCNAMES;
+		xecfg.ctxlen = 0;
+		xecfg.flags = 0;
 		ecb.outf = xdiff_outf;
 		ecb.priv = diffstat;
 		xdl_diff(&mf1, &mf2, &xpp, &xecfg, &ecb);
-- 
1.3.0.rc3.g9306

^ permalink raw reply related

* Re: Solaris test t5500 race condition
From: Jason Riedy @ 2006-04-14  5:03 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <20060414031759.GA9524@bohr.gbar.dtu.dk>

And "Peter Eriksen" writes:
 - I've found a race in t5500-fetch-pack.sh.

Crap.  I ran into this on AIX a while ago; I was hoping no
other systems would see it.  There are no guarantees that 
the two processes' outputs will be mutually line buffered.
Luckily, it's just a cosmetic problem, but it does cause 
that test case to fail.

I know how to fix it (imho), but have no time to implement
it.  There needs to be a separate communication stage after 
negotiating the objects and before dumping the pack.  During
that stage, upload-pack would just send progress notices to 
the caller.  Only the caller would communicate to the terminal.
Some other ideas are in
  http://marc.theaimsgroup.com/?l=git&m=114357528512063&w=2

Jason

^ permalink raw reply

* Re: Solaris test t5500 race condition
From: Junio C Hamano @ 2006-04-14  5:34 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <20060414031759.GA9524@bohr.gbar.dtu.dk>

"Peter Eriksen" <s022018@student.dtu.dk> writes:

>     Generating pack...
>     Done counting 3 objects.
>     Deltifying 3 objects.
>       33% (1/3) done^M  66% (2/3) done^M 100% (3/3) done
>     Total 3Unpacking , written 33 objects          <------------
>      (delta 0), reused 0 (delta 0)
>     11fa2f0cb58ed7f02dbd5ac75ed82a53fae62a7b refs/heads/A

Hmph.  Not good.  Before the writer managed to flush the report
the reader has already decoded the header and reports the number
of objects it is going to unpack.

Unfortunately the Solaris box I have access to is perhaps
sufficiently slow that this is not an issue X-<.

I think test based on the eye-candy is fragile anyway.  We would
want to probably _count_ before and after to see if the command
did what we expected.

There is a subtle difficulty doing so, however.  The test is
trying to see if fetch-pack vs upload-pack negotiations result
in minimal transfer, but if it is not, unpack side would just
happily say "I received this one, oh, I already have it".

We could do "fetch-pack -k" to keep the result packed, count the
number of objects in the resulting pack.

How about doing something like this instead?

-- >8 --
[PATCH] t5500: test fix

Relying on eye-candy progress bar was fragile to begin with.
Run fetch-pack with -k option, and count the objects that are in
the pack that were transferred from the other end.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 t/t5500-fetch-pack.sh |   33 ++++++++++++++-------------------
 1 files changed, 14 insertions(+), 19 deletions(-)

7f732c632ff7a1adc2309257becdc0c1fe76b514
diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index e15e14f..92f12d9 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -12,11 +12,6 @@ # Test fetch-pack/upload-pack pair.
 
 # Some convenience functions
 
-function show_count () {
-	commit_count=$(($commit_count+1))
-	printf "      %d\r" $commit_count
-}
-
 function add () {
 	local name=$1
 	local text="$@"
@@ -55,13 +50,6 @@ function test_expect_object_count () {
 		"test $count = $output"
 }
 
-function test_repack () {
-	local rep=$1
-
-	test_expect_success "repack && prune-packed in $rep" \
-		'(git-repack && git-prune-packed)2>>log.txt'
-}
-
 function pull_to_client () {
 	local number=$1
 	local heads=$2
@@ -70,13 +58,23 @@ function pull_to_client () {
 
 	cd client
 	test_expect_success "$number pull" \
-		"git-fetch-pack -v .. $heads > log.txt 2>&1"
+		"git-fetch-pack -k -v .. $heads"
 	case "$heads" in *A*) echo $ATIP > .git/refs/heads/A;; esac
 	case "$heads" in *B*) echo $BTIP > .git/refs/heads/B;; esac
 	git-symbolic-ref HEAD refs/heads/${heads:0:1}
+
 	test_expect_success "fsck" 'git-fsck-objects --full > fsck.txt 2>&1'
-	test_expect_object_count "after $number pull" $count
-	pack_count=$(grep Unpacking log.txt|tr -dc "0-9")
+
+	test_expect_success 'check downloaded results' \
+	'mv .git/objects/pack/pack-* . &&
+	 p=`ls -1 pack-*.pack` &&
+	 git-unpack-objects <$p &&
+	 git-fsck-objects --full'
+
+	test_expect_success "new object count after $number pull" \
+	'idx=`echo pack-*.idx` &&
+	 pack_count=`git-show-index <$idx | wc -l` &&
+	 test $pack_count = $count'
 	test -z "$pack_count" && pack_count=0
 	if [ -z "$no_strict_count_check" ]; then
 		test_expect_success "minimal count" "test $count = $pack_count"
@@ -84,6 +82,7 @@ function pull_to_client () {
 		test $count != $pack_count && \
 			echo "WARNING: $pack_count objects transmitted, only $count of which were needed"
 	fi
+	rm -f pack-*
 	cd ..
 }
 
@@ -117,8 +116,6 @@ git-symbolic-ref HEAD refs/heads/B
 
 pull_to_client 1st "B A" $((11*3))
 
-(cd client; test_repack client)
-
 add A11 $A10
 
 prev=1; cur=2; while [ $cur -le 65 ]; do
@@ -129,8 +126,6 @@ done
 
 pull_to_client 2nd "B" $((64*3))
 
-(cd client; test_repack client)
-
 pull_to_client 3rd "A" $((1*3)) # old fails
 
 test_done
-- 
1.3.0.rc3.g9306

^ permalink raw reply related

* What's in git.git
From: Junio C Hamano @ 2006-04-14  7:49 UTC (permalink / raw)
  To: git

Getting closer with bunch of fixes, perhaps a real 1.3.0 early
next week.

I'd appreciate people beating what's in the "master" branch to
shake down the last minute brown paper bag problems.  

BTW, I shifted my git day from usual Wednesday to Thursday this
week.  I may do the same the next week.

* The 'master' branch has these since the last announcement.

 - More Solaris 9 portability (Dennis Stosberg)
 - kill index() and replace it with strchr() (Dennis Stosberg)
 - git-apply -C to apply patch with fuzz (Eric W. Biederman)
 - git-log [diff options]
 - Retire git-log.sh
 - Combine-diff fix
 - Retire t5501 test
 - Fix "echo -n foo | git commit -F -"
 - diff --patch-with-raw (Pasky and me)
 - Documentation updates (Pasky and me)
 - Fix running t3600 test as root.
 - "expr match : foobar" fix (Mark Wooding and me)
 - commit message formatting fix for incomplete line (Linus)
 - git-log memory footprint fix (Linus)

* The 'next' branch, in addition, has these.

 - xdiff: post-process hunks to make them consistent (Davide Libenzi)
 - diff --stat (Johannes Schindelin and me)
 - t5500 test fix

^ permalink raw reply

* Recent unresolved issues
From: Junio C Hamano @ 2006-04-14  9:31 UTC (permalink / raw)
  To: git

Here is a list of topics in the recent git traffic that I feel
inadequately addressed.  I've commented on some of them to give
people a feel for what my priorities are.  Somebody might want
to rehash the ones low on my priority list to conclusion with a
concrete proposal if they cared about them enough.  The list is
*not* ordered in any way.

Also please add whatever I missed (or dismissed).  I am hoping
this will be a good basis for 1.4 to-do list.

* Message-ID: <Pine.LNX.4.64.0604121828370.14565@g5.osdl.org>
  Common option parsing (Linus Torvalds)

* Message-ID: <Pine.LNX.4.64.0604050855080.2550@localhost.localdomain>
  Binary diff output? (Nicolas Pitre)

  I do not think this is needed for our primary audience (the
  kernel project), but I am sure it would be helpful for some
  other projects if we allowed them to exchange patches that
  describe binary file changes via e-mail, so I am not
  dismissing this.  Needs to wait "option parsing".

* Message-ID: <Pine.LNX.4.64.0604111725590.14565@g5.osdl.org>
  Colored diff? (Linus Torvalds)

  I am not opposed to it, but I'd like to do that internally if
  we go this route.  Needs to wait "option parsing".  Also
  Message-ID: <3536.10.10.10.24.1114117965.squirrel@linux1> is
  slightly related to this.

* Message-ID: <7vek02ynif.fsf@assigned-by-dhcp.cox.net>
  diff --with-raw, --with-stat? (me)

  I think "git diff" can be internalized next, after "option
  parsing" unification.  When that is done, --with-stat would
  help internalize format-patch's process_one(), and it would be
  trivial to do "git log --pretty=format-patch master..next".

* #irc 2006-04-10
  Shallow clones (Carl Worth).

  The experiment last round did not work out very well, but as
  existing repositories get bigger, and more projects being
  migrated from foreign SCM systems, this would become a
  must-have from would-be-nice-to-have.

  I am beginning to think using "graft" to cauterize history
  for this, while it technically would work, would not be so
  helpful to users, so the design needs to be worked out again.

* Message-ID: <E1FMH3o-0001B5-Dw@jdl.com>
  git status does not distinguish contents changes and mode
  changes; it just says "modified" (Jon Loeliger).

  Unconditionally changing the status letter would break
  Porcelains so we would need an extra option to do this.
  An outline patch has been already prepared -- this perhaps has
  to wait until we sort out the "option parsing" one.

* Message-ID: <tnxmzf9sh7k.fsf@arm.com>
  git could use diff3 instead of merge which is a wrapper around
  diff3. (Catalin Marinas)

  If having "diff3" is a lot more common than having "merge", I
  do not have problem with this; "merge" being a wrapper to
  "diff3", people who have been happy with the current code
  would certainly have "diff3" installed so changing to "diff3"
  would not break them.

* Message-ID: <81b0412b0603020649u99a2035i3b8adde8ddce9410@mail.gmail.com>
  Windows problems summary (Alex Riesen)

  A good list to keep in mind.

* Message-ID: <Pine.LNX.4.64.0604030730040.3781@g5.osdl.org>
  Huge packfiles (Linus Torvalds)

  Because I do not think asking users to break up packs to
  manageable and mmap()able size is too much to ask, I would not
  be advocating for updating the pack idx to 64-bit offset and
  mmap()ing parts of a packfile, at least too strongly.

  However, we currently lack tool support or recepe for users
  with such a repository to easily break up packs.

* Message-ID: <1143856098.3555.48.camel@dv>
  Per branch property, esp. where to merge from (Pavel Roskin)

  This involves user-level "world model" design, which is more
  Porcelainish than Plumbing, and as people know I do not do
  Porcelain well; interested parties need to come up with what
  they want and how they want to use it.

^ permalink raw reply

* Re: Solaris test t5500 race condition
From: Peter Eriksen @ 2006-04-14 11:53 UTC (permalink / raw)
  To: git
In-Reply-To: <7vhd4wvhyq.fsf@assigned-by-dhcp.cox.net>

On Thu, Apr 13, 2006 at 10:34:05PM -0700, Junio C Hamano wrote:
> "Peter Eriksen" <s022018@student.dtu.dk> writes:
> 
> >     Generating pack...
> >     Done counting 3 objects.
> >     Deltifying 3 objects.
> >       33% (1/3) done^M  66% (2/3) done^M 100% (3/3) done
> >     Total 3Unpacking , written 33 objects          <------------
> >      (delta 0), reused 0 (delta 0)
> >     11fa2f0cb58ed7f02dbd5ac75ed82a53fae62a7b refs/heads/A
> 
> Hmph.  Not good.  Before the writer managed to flush the report
> the reader has already decoded the header and reports the number
> of objects it is going to unpack.
...
> -- >8 --
> [PATCH] t5500: test fix

With the patch it doesn't complain anymore.  There are many other 
problems with the tests on Solaris though.

Peter

^ permalink raw reply

* Re: Recent unresolved issues
From: Petr Baudis @ 2006-04-14 16:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v64lcqz9j.fsf@assigned-by-dhcp.cox.net>

Dear diary, on Fri, Apr 14, 2006 at 11:31:36AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> Here is a list of topics in the recent git traffic that I feel
> inadequately addressed.  I've commented on some of them to give
> people a feel for what my priorities are.  Somebody might want
> to rehash the ones low on my priority list to conclusion with a
> concrete proposal if they cared about them enough.  The list is
> *not* ordered in any way.

Nice summary!

> * Message-ID: <tnxmzf9sh7k.fsf@arm.com>
>   git could use diff3 instead of merge which is a wrapper around
>   diff3. (Catalin Marinas)
> 
>   If having "diff3" is a lot more common than having "merge", I
>   do not have problem with this; "merge" being a wrapper to
>   "diff3", people who have been happy with the current code
>   would certainly have "diff3" installed so changing to "diff3"
>   would not break them.

I've decided to bite the bullet and made Cogito use diff3 instead of
merge as of now. Let's see if anybody complains...

> * Message-ID: <1143856098.3555.48.camel@dv>
>   Per branch property, esp. where to merge from (Pavel Roskin)
> 
>   This involves user-level "world model" design, which is more
>   Porcelainish than Plumbing, and as people know I do not do
>   Porcelain well; interested parties need to come up with what
>   they want and how they want to use it.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply

* Re: Default remote branch for local branch
From: Petr Baudis @ 2006-04-14 16:16 UTC (permalink / raw)
  To: Josef Weidendorfer; +Cc: Pavel Roskin, Junio C Hamano, git
In-Reply-To: <200604021817.30222.Josef.Weidendorfer@gmx.de>

Dear diary, on Sun, Apr 02, 2006 at 06:17:29PM CEST, I got a letter
where Josef Weidendorfer <Josef.Weidendorfer@gmx.de> said that...
> > I would write the config like this:
> > 
> > [branch-upstream]
> > master = linus
> > ata-irq-pio = irq-pio
> > ata-pata = pata-drivers
> 
> That is not working, as said above. But with above syntax extension,
> with s/=/for/ it would be fine.

I'm sorry but I'm slow and I don't see it - why wouldn't this work?
(Except that the key name is case insensitive, which isn't too big a
deal IMHO.)

I for one think that the 'for'-syntax is insane - it's unreadable (your
primary query is by far most likely to be "what's the upstream when on
branch X", not "what branches is this upstream for"), would convolute
the configuration file syntax unnecessarily and would possibly also
complicate the git-repo-config interface. Pavel's syntax is much nicer.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply

* git-stripspace breakage
From: Linus Torvalds @ 2006-04-14 16:40 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


Junio,
 the current git-stripspace leaves extra newlines at the end, causing ugly 
commit logs in "git log". I assume/suspect that it's the recent 
"incomplete line" handling (that I acked, bad me), but I didn't actually 
test.

Trivially tested thus:

	[torvalds@g5 git]$ git-stripspace <<EOF
	> a
	> 
	> EOF
	a
	
	[torvalds@g5 git]$ 

note the extra unnecessary newline..

		Linus

^ permalink raw reply

* Re: Solaris test t5500 race condition
From: Jason Riedy @ 2006-04-14 16:41 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <20060414115317.GA5191@bohr.gbar.dtu.dk>

And "Peter Eriksen" writes:
 - > -- >8 --
 - > [PATCH] t5500: test fix
 - 
 - With the patch it doesn't complain anymore.  There are many other 
 - problems with the tests on Solaris though.

I just ran next branch's tests on 5.8 with no problems.  Could 
you be a bit more specific?

Jason

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Geert Bosch @ 2006-04-14 17:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vmzer4vmm.fsf@assigned-by-dhcp.cox.net>


On Apr 11, 2006, at 18:04, Junio C Hamano wrote:
>> Here's a possible way to do that first cut. Basically,
>> compute a short (256-bit) fingerprint for each file, such
>> that the Hamming distance between two fingerprints is a measure
>> for their similarity. I'll include a draft write up below.
>
> Thanks for starting this.
>
> There are a few things I need to talk about the way "similarity"
> is _used_ in the current algorithms.
>
> Rename/copy detection outputs "similarity" but I suspect what
> the algorithm wants is slightly different from what humans think
> of "similarity".  It is somewhere between "similarity" and
> "commonness".  When you are grading a 130-page report a student
> submitted, you would want to notice that last 30 pages are
> almost verbatim copy from somebody else's report.  The student
> in question added 100-page original contents so maybe this is
> not too bad, but if the report were a 30-page one, and the
> entier 30 pages were borrowed from somebody else's 130-page
> report, you would _really_ want to notice.

There just isn't enough information in a 256-bit fingerprint
to be able to determine if two strings have a long common
substring. Also, when the input gets longer, like a few MB,
or when the input has little information content (compresses
very well), statistical bias will reduce reliability.

Still, I used the similarity test on large tar archives, such
as complete GCC releases, and it does give reasonable
similarity estimates. Non-related inputs rarely have scores
above 5.

potomac%../gsimm - 
rd026c470aab28a1086403768a428358f218bba049d47e7d49f8589c2c0baca0c *.tar
55746560 gcc-2.95.1.tar 123 3.1
55797760 gcc-2.95.2.tar 112 11.8
55787520 gcc-2.95.3.tar 112 11.8
87490560 gcc-3.0.1.tar 112 11.8
88156160 gcc-3.0.2.tar 78 38.6
86630400 gcc-3.0.tar 80 37.0
132495360 gcc-3.1.tar 0 100.0

I'm mostly interested in the data storage aspects of git,
looking bottom-up at the blobs stored and deriving information
from that. My similarity estimator allows one to look at thousands
of large checked in files and quickly identify similar files.
For example, in the above case, you'd find it makes sense
to store gcc-3.1.tar as a difference from gcc-3.0.tar.
Doing an actual diff between these two archives takes a few
seconds, while the fingerprints can be compared in microseconds.

> While reorganizaing a program, a nontrivial amount of text is
> often removed from an existing file and moved to a newly created
> file.  Right now, the way similarity score is calculated has a
> heuristical cap to reject two files whose sizes are very
> different, but to detect and show this kind of file split, the
> sizes of files should matter less.
The way to do this is to split a file at content-determined
breakpoints: check the last n bits of a cyclic checksum over
a sliding window, and break if they match a magic number.
This would split the file in blocks with expected size of 2^n.
Then you'd store a fingerprint per chunk.
> [...]
> Another place we use "similarity" is to break a file that got
> modified too much.  This is done for two independent purposes.
This could be done directly using the given algorithm.

> [...] Usually rename/copy
> detection tries to find rename/copy into files that _disappear_
> from the result, but with the above sequence, B never
> disappears.  By looking at how dissimilar the preimage and
> postimage of B are, we tell the rename/copy detector that B,
> although it does not disappear, might have been renamed/copied
> from somewhere else.
This could also be cheaply determined by my similarity estimator.
Almost always, you'd have a high similarity score. When there is
a low score, you could verify with a more precise and expensive
algorithm to have a consistent decision on what is considered
a break.

There is a -v option that gives more verbose output, including
estimated and actual average distances from the origin for the
random walks. For random input they'll be very close, but for
input with a lot of repetition the actual average will be far
larger. The ratio can be used as a measure of reliability of
the fingerprint: ratio's closer to 1 are better.
> Also we can make commonness matter even more in the similarlity
> used to "break" a file than rename detector, because if we are
> going to break it, we will not have to worry about the issue of
> showing an annoying diff that removes 100 lines after copying a
> 130-line file.  This implies that the break algorithm needs to
> use two different kinds of similarity, one for breaking and then
> another for deciding how to show the broken pieces as a diff.
>
> Sorry if this write-up does not make much sense.  It ended up
> being a lot more incoherent than I hoped it to be.
Regular diff algorithms will always give the most precise result.
What my similarity estimator does is give a probability that
two files have a lot of common substrings. Say, you'd have a
git archive with 10,000 blobs of about 1 MB, and you'd want
to determine how to pack this. You clearly can't use diff
programs to solve this, but you can use the estimates.

> Anyway, sometime this week I'll find time to play with your code
> myself.
Thanks, I'm looking forward to your comments.

   -Geert

^ permalink raw reply

* Re: Default remote branch for local branch
From: Josef Weidendorfer @ 2006-04-14 18:26 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git
In-Reply-To: <20060414161627.GA27689@pasky.or.cz>

On Friday 14 April 2006 18:16, you wrote:
> Dear diary, on Sun, Apr 02, 2006 at 06:17:29PM CEST, I got a letter
> where Josef Weidendorfer <Josef.Weidendorfer@gmx.de> said that...
> > > I would write the config like this:
> > > 
> > > [branch-upstream]
> > > master = linus
> > > ata-irq-pio = irq-pio
> > > ata-pata = pata-drivers
> > 
> > That is not working, as said above. But with above syntax extension,
> > with s/=/for/ it would be fine.
> 
> I'm sorry but I'm slow and I don't see it - why wouldn't this work?
> (Except that the key name is case insensitive, which isn't too big a
> deal IMHO.)

Hmm...
* IMHO "keys are case insensitive" is enough to not qualify for branch
names: currently, branch names are case sensitive, and with above syntax you
effectively change this rule (you can not distinguish upstreams for "master"
vs. "MASTER").
* a dot currently seems to be allowed in branch names. For config keys, the
dot separates subkeys.
* I thought it is a convention for config keys to be alphanum only,
eg. "/" isn't allowed, too (which is mandatory for branch names).
Unfortunately, I found nothing about allowed chars for config keys in the
documentation.
 
> I for one think that the 'for'-syntax is insane - it's unreadable (your
> primary query is by far most likely to be "what's the upstream when on
> branch X", not "what branches is this upstream for"), would convolute
> the configuration file syntax unnecessarily and would possibly also
> complicate the git-repo-config interface.

As far as I remember, the "... for ..." syntax was suggested by Linus for the
proxy.command config a long time ago. The original proposal there was to
use an URL as key part (as far as I can remember).

That said,

> Pavel's syntax is much nicer. 

... I agree with you here.

My suggestion would be to allow an optional syntax in the config file which is mapped
by git-repo-config to the normalized "... for ..."-scheme.
Eg. it should not be mandatory to specify "for ..." after the value of a key.
So instead of

  branch.upstream = linus for master

you should be able to say

  [branch]
  upstream for master = linus


Josef

^ permalink raw reply

* [PATCH] cg-admin-rewritehist: Seed the commit map with the parents specified with -r.
From: Johannes Sixt @ 2006-04-14 18:54 UTC (permalink / raw)
  To: git

When the first commit is manufactured, its parents are looked up in the
commit map. However, without this patch the map is always empty at that time.
If the entire history is rewritten, this is no problem because the first
commit does not have any parents anyway. However, if -r is used to constrain
rewriting to only part of the history, this first commit is manufactured
incorrectly without parents because 'cat' fails.

Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>

---

 cg-admin-rewritehist |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

ec09427d1fb4097c15fd6df4f07049a536bb7d2c
diff --git a/cg-admin-rewritehist b/cg-admin-rewritehist
index 9c49d80..b72c641 100755
--- a/cg-admin-rewritehist
+++ b/cg-admin-rewritehist
@@ -138,6 +138,7 @@ _git_requires_root=1
 
 tempdir=.git-rewrite
 startrev=
+startrevparents=
 filter_env=
 filter_tree=
 filter_index=
@@ -149,6 +150,7 @@ while optparse; do
 		tempdir="$OPTARG"
 	elif optparse -r=; then
 		startrev="^$OPTARG $OPTARG $startrev"
+		startrevparents="$OPTARG $startrevparents"
 	elif optparse --env-filter=; then
 		filter_env="$OPTARG"
 	elif optparse --tree-filter=; then
@@ -182,6 +184,11 @@ ret=0
 
 
 mkdir ../map # map old->new commit ids for rewriting parents
+
+# seed with identity mappings for the parents where we start off
+for commit in $startrevparents; do
+	echo $commit > ../map/$commit
+done
 
 git-rev-list --topo-order HEAD $startrev | tac >../revs
 commits=$(cat ../revs | wc -l)
-- 
1.3.0.rc2

^ permalink raw reply related

* Re: git-stripspace breakage
From: Junio C Hamano @ 2006-04-14 19:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0604140936520.3701@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Junio,
>  the current git-stripspace leaves extra newlines at the end, causing ugly 
> commit logs in "git log". I assume/suspect that it's the recent 
> "incomplete line" handling (that I acked, bad me), but I didn't actually 
> test.

Bad me too indeed.  I noticed it last night after writing
"What's in" message.  Will fix shortly.

Thanks.

^ permalink raw reply

* Re: Recent unresolved issues
From: sean @ 2006-04-14 19:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v64lcqz9j.fsf@assigned-by-dhcp.cox.net>

On Fri, 14 Apr 2006 02:31:36 -0700
Junio C Hamano <junkio@cox.net> wrote:

> * Message-ID: <Pine.LNX.4.64.0604111725590.14565@g5.osdl.org>
>   Colored diff? (Linus Torvalds)
> 
>   I am not opposed to it, but I'd like to do that internally if
>   we go this route.  Needs to wait "option parsing".  Also
>   Message-ID: <3536.10.10.10.24.1114117965.squirrel@linux1> is
>   slightly related to this.

Moving it internal sounds like a good idea.  Would you be open to
including the GIT_DIFF_PAGER option now anyway?   It has utility
beyond just color diffs.

Sean

^ permalink raw reply

* Re: Recent unresolved issues
From: Petr Baudis @ 2006-04-14 19:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v64lcqz9j.fsf@assigned-by-dhcp.cox.net>

Dear diary, on Fri, Apr 14, 2006 at 11:31:36AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> * Message-ID: <Pine.LNX.4.64.0604111725590.14565@g5.osdl.org>
>   Colored diff? (Linus Torvalds)
> 
>   I am not opposed to it, but I'd like to do that internally if
>   we go this route.  Needs to wait "option parsing".  Also
>   Message-ID: <3536.10.10.10.24.1114117965.squirrel@linux1> is
>   slightly related to this.

It might be worthwhile to make Git and Cogito compatible if you offer
colors customization. Cogito lets the user customize the colors through
the $CG_COLORS variable (see cg-diff(1)).

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply

* git-svn and Author files question
From: Seth Falcon @ 2006-04-14 20:34 UTC (permalink / raw)
  To: git

Hi all,

I've been using git to manually track changes to a project that uses
svn as its primary SCM.

git-svn looks like it can help me streamline my workflow, but I'm
getting stuck with the following:

    mkdir foo
    cd foo
    git-svn init $URL  <--- the svn URL
    git-svn fetch
    Author: dfcimm3 not defined in  file

:-(

Can someone point me to the file and the place that describes what I
should put in it?  There are many committers to the svn project.  I'm
hoping that I will not have to enumerate all of their names in some
file.

I'm using git version 1.3.0.rc1.g40e9, and BTW, enjoying it very much.

Thanks,

+ seth

^ permalink raw reply

* git log is a bit antisocial
From: Nicolas Pitre @ 2006-04-14 20:50 UTC (permalink / raw)
  To: git


$  git log -h
fatal: unrecognized argument: -h
$ git log --help
fatal: unrecognized argument: --help

Maybe the usage string could be printed in those cases?


Nicolas

^ permalink raw reply

* Re: git log is a bit antisocial
From: Junio C Hamano @ 2006-04-14 20:56 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604141647360.2215@localhost.localdomain>

Nicolas Pitre <nico@cam.org> writes:

> $  git log -h
> fatal: unrecognized argument: -h
> $ git log --help
> fatal: unrecognized argument: --help
>
> Maybe the usage string could be printed in those cases?

Perhaps.  Alternatively, "git help log", perhaps.

^ permalink raw reply

* Re: git log is a bit antisocial
From: Nicolas Pitre @ 2006-04-14 21:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vlku7q3k7.fsf@assigned-by-dhcp.cox.net>

On Fri, 14 Apr 2006, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > $  git log -h
> > fatal: unrecognized argument: -h
> > $ git log --help
> > fatal: unrecognized argument: --help
> >
> > Maybe the usage string could be printed in those cases?
> 
> Perhaps.  Alternatively, "git help log", perhaps.

What about git-log then?


Nicolas

^ permalink raw reply

* Re: git log is a bit antisocial
From: Junio C Hamano @ 2006-04-14 21:28 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604141719290.2215@localhost.localdomain>

Nicolas Pitre <nico@cam.org> writes:

> On Fri, 14 Apr 2006, Junio C Hamano wrote:
>
>> Nicolas Pitre <nico@cam.org> writes:
>> 
>> > $  git log -h
>> > fatal: unrecognized argument: -h
>> > $ git log --help
>> > fatal: unrecognized argument: --help
>> >
>> > Maybe the usage string could be printed in those cases?
>> 
>> Perhaps.  Alternatively, "git help log", perhaps.
>
> What about git-log then?

What about it?

Asking for help on log could be spelled as "git log --help" with
a patch like the attached, but I am not sure that is worth it...

-- >8 --
diff --git a/git.c b/git.c
index 78ed403..7fdacdd 100644
--- a/git.c
+++ b/git.c
@@ -497,6 +497,16 @@ int main(int argc, const char **argv, ch
 	}
 	argv[0] = cmd;
 
+	/* It could be git blah --help or git boo -h, but be
+	 * careful; most commands have their own '-h' and '--help'.
+	 */
+	if (argc == 2 &&
+	    (!strcmp(argv[1], "-h") || !strcmp(argv[1], "--help"))) {
+		argv[0] = "help";
+		argv[1] = cmd;
+		exit(cmd_help(1, argv, envp));
+	}
+
 	/*
 	 * We search for git commands in the following order:
 	 *  - git_exec_path()

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox