Git development

Git development
 help / color / mirror / Atom feed

* Re: best git practices, was Re: Git User's Survey 2007 unfinished summary continued
From: Andreas Ericsson @ 2007-10-22 12:44 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jakub Narebski, Steffen Prohaska, Federico Mena Quintero, git
In-Reply-To: <Pine.LNX.4.64.0710221156540.25221@racer.site>

Johannes Schindelin wrote:
> So once again, what operations involving git do people use regularly?
> 

diff
qgit
commit
fetch
rebase
merge
status
push
cherry-pick
grep
bisect
add
show-ref

If I were to suggest any improvements, it'd be to change the semantics 
of git-pull to always update the local branches set up to be merged with 
the remote tracking branches when they, prior to fetching, pointed to 
the same commit, such that when

$ git show-ref master
d4027a816dd0b416dc8c7b37e2c260e6905f11b6 refs/heads/master
d4027a816dd0b416dc8c7b37e2c260e6905f11b6 refs/remotes/origin/master

refs/heads/master gets set to refs/remotes/origin/master post-fetch.

This would save me from this command sequence, which I currently have to 
do for git.

git fetch
git checkout next
git merge spearce/next
git checkout master
git merge spearce/master
git checkout maint
git merge spearce/maint
git checkout pu
git reset --hard spearce/pu

<rinse and repeat for every tracked branch>

git could definitely help here. I want the local branches to be 
up-to-date with the remote ones, because I frequently run diffs against 
the various branches to see if anything that I should be aware of has 
changed, and just as frequently I forget to add that 'origin/' prefix, 
which means I *might* be looking at old code.

I usually do that on internal projects, where we have "master", "next", 
"testing", and "stable" branches for pretty much every repo. We have 54 
git repos. The typing adds up. This is also one of the most frequent 
causes of confusion for my (even) less git-savvy co-workers. The 
argument usually goes like this:
"Umm... Peter, why did you commit your fix on top of 7 weeks old code?"
"Oh? I did git-pull first, just as you said, so it should have been the 
latest, shouldn't it?"
"Well, what branch were you on when you pulled?"
"Err.. does that matter? I didn't have any local modifications on the 
branch when I pulled, so it should have just updated it."

What's happened prior to such an argument is usually this:
next or master is inevitably checked out. The user does git-pull to get 
up to date. They then change branch and get down to business with 
rebasing, merging and editing. When it's time to push, git tells them 
"not a strict subset. use git-pull!", and they do, and sometimes it 
fails, and I have a hard time explaining why since I really don't see a 
reason for *not* updating all "to-merge" branches when they point to the 
same commit as their tracking-branch before the pull.

Patch to follow (at some point), although it's likely to make git-pull a 
built-in since I have no idea how to maintain coupled lists in shell.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* Re: Git User's Survey 2007 unfinished summary continued
From: Jakub Narebski @ 2007-10-22 12:26 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Johannes Schindelin, Steffen Prohaska, Federico Mena Quintero,
	git
In-Reply-To: <471C586A.9030900@op5.se>

On 10/22/07, Andreas Ericsson <ae@op5.se> wrote:
[...]
>>>>> On 10/20/07, Steffen Prohaska <prohaska@zib.de> wrote:
>>>>>
>>>>>> Maybe we could group commands into more categories?

> Similarly, it might be helpful to have help topics the gdb way, like
> "git help patches". It's one of those things that people have come to
> expect from a software tool, so perhaps we should humor them? Given gits
> "every help topic is a man-page" idiom, this shouldn't require any real
> technical effort.
>
> Such topics should probably include
> merge/merges/merging - overview of various ways of putting two lines of
> development back together
> patch/patches - how to create, send and apply
> tags/branches/refs - what they are, why they're good, link to merging

Very good idea. It is definitely something that can be worked on.

By the way, what do you think about "spying" version of git, specially
marked release which gathers statistics of porcelain used, with
frequency of its use, and git-sendstats command added in this release?

-- 
Jakub Narebski

^ permalink raw reply

* Re: [PATCH] Add some fancy colors in the test library when terminal supports it.
From: Johannes Sixt @ 2007-10-22 12:18 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Shawn O. Pearce, git
In-Reply-To: <20071022121106.GA7151@artemis.corp>

Pierre Habouzit schrieb:
> On Mon, Oct 22, 2007 at 11:35:30AM +0000, Johannes Sixt wrote:
>> Pierre Habouzit schrieb:
>>> On Mon, Oct 22, 2007 at 08:53:36AM +0000, Johannes Sixt wrote:
>>>> Pierre Habouzit schrieb:
>>>>> +say_color () {
>>>>> +	[ "$nocolor" = 0 ] &&  [ "$1" != '-1' ] && tput setaf "$1"
>>>>> +	shift
>>>>> +	echo "* $*"
>>>>> +	tput op
        ^^^^^^^^
I am talking about this line.

>>>>> +}
>> I wanted to point out that if tput is not 
>> available, the second invocation will leave "tput: command not found" 
>> behind on stderr. Therefore, I proposed to make the definition of 
>> say_color() different depending on whether $color is set or not. Then you 
>> don't need to test for $color twice inside the function.
> 
>   Right we can do that. I'll try to rework the patch. and no it
> shouldn't leave tput: command not found as I 2>/dev/null and I think the
> shell doesn't print that in that case. At least my zsh doesn't.

There is no 2>/dev/null. Am I missing something?

-- Hannes

^ permalink raw reply

* Re: [PATCH] Add some fancy colors in the test library when terminal   supports it.
From: Pierre Habouzit @ 2007-10-22 12:11 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Shawn O. Pearce, git
In-Reply-To: <471C8B02.6080202@viscovery.net>

[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]

On Mon, Oct 22, 2007 at 11:35:30AM +0000, Johannes Sixt wrote:
> Pierre Habouzit schrieb:
> >On Mon, Oct 22, 2007 at 08:53:36AM +0000, Johannes Sixt wrote:
> >>Pierre Habouzit schrieb:
> >>>+say_color () {
> >>>+	[ "$nocolor" = 0 ] &&  [ "$1" != '-1' ] && tput setaf "$1"
> >>>+	shift
> >>>+	echo "* $*"
> >>>+	tput op
> >>>+}
> >>What if tput is not available, like on Windows? How about this (at the 
> >>end of the file, so it can obey --no-color):
> >  I answered to it already in my first mail: if tput isn't available,
> >the command fails, and $? is non 0. and nocolor is set. Or color isn't
> >set to 't' for your proposal.
> 
> I was too terse, sorry. I wanted to point out that if tput is not 
> available, the second invocation will leave "tput: command not found" 
> behind on stderr. Therefore, I proposed to make the definition of 
> say_color() different depending on whether $color is set or not. Then you 
> don't need to test for $color twice inside the function.

  Right we can do that. I'll try to rework the patch. and no it
shouldn't leave tput: command not found as I 2>/dev/null and I think the
shell doesn't print that in that case. At least my zsh doesn't.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [BUG] git-mv submodule failure
From: Yin Ping @ 2007-10-22 11:46 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0710211102230.25221@racer.site>

On 10/21/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:

> No.  If you "git mv" a submodule, it makes absolutely no sense to leave
> .gitmodules as is.
>
 I can modify the .gitmodule manually since git-submodule mv is not
available. Anyway, I just want a command to rename a submodule easily,
not "mv && git-rm && git-add".
>


-- 
franky

^ permalink raw reply

* Re: git filter-branch --subdirectory-filter error
From: Jan Wielemaker @ 2007-10-22 11:37 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0710221218150.25221@racer.site>

Dscho,

On Monday 22 October 2007 13:20, Johannes Schindelin wrote:
> Hi,
>
> On Mon, 22 Oct 2007, Jan Wielemaker wrote:
> > Finished a big re-shuffle of a big project, while other developers
> > continued. Worked really well. Thanks guys! But now I have two top
> > directories and I want to create two new repositories, each containing
> > one of these directories (because the one holds copyrighted data and we
> > want the other to become public software). So, I happily run
> >
> > 	$ git filter-branch --subdirectory-filter RDF HEAD
> >
> > Where RDF is an existing directory.  I get:
> >
> > Rewrite 95807fe01c39d3092e3ac3a98061711323154d77 (1/12)fatal: Not a valid
> > object name 95807fe01c39d3092e3ac3a98061711323154d77:RDF
> > Could not initialize the index
>
> I guess that 95807fe01 is the parent of a commit adding the RDF/
> directory.
>
> The subdirectory filter does not look kindly upon a history where some
> commits lack the subdirectory in question.  However, this should work:
>
> 	git filter-branch --subdirectory--filter RDF 95807fe01..HEAD

Thanks, but ... hmmm.

$ git filter-branch --subdirectory-filter RDF 
95807fe01c39d3092e3ac3a98061711323154d77..HEAD
Rewrite 0a43c802dd60f53d48136a32526a4b2a5f0d43e5 (1/11)fatal: Not a valid 
object name 0a43c802dd60f53d48136a32526a4b2a5f0d43e5:RDF
Could not initialize the index

$ git show 0a43c802dd60f53d48136a32526a4b2a5f0d43e5
commit 0a43c802dd60f53d48136a32526a4b2a5f0d43e5
Merge: 49fa961... 95807fe...
Author: XXX
Date:   Thu Oct 18 17:45:26 2007 +0200

    Merge branch 'master' of 
hildebra@gollem.science.uva.nl:/home/eculture/eculture

Tried 0a43c802dd60f53d48136a32526a4b2a5f0d43e5..HEAD, just to get
another one :-( I guess this will go on a little while :-( Before I
start writing a script that performs this procedure findind a place
where it does work I'd like to share some history with you.

This started as a big project with a lot of history in CVS, including
moved (read deleted and re-created) files. This was moved to SVN and
from there immediately to GIT. In GIT lots of things have been renamed.
The RDF directory was created quite recent in the project and things
from various subdirectories were moved there.

Is there something that might be worth a try or should we go the simple
way: keeping the old combined repo for later reference and create two
new ones from fresh files?

	Cheers --- Jan

^ permalink raw reply

* Re: [PATCH] Add some fancy colors in the test library when terminal supports it.
From: Johannes Sixt @ 2007-10-22 11:35 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Johannes Sixt, Shawn O. Pearce, git
In-Reply-To: <20071022112401.GE32763@artemis.corp>

Pierre Habouzit schrieb:
> On Mon, Oct 22, 2007 at 08:53:36AM +0000, Johannes Sixt wrote:
>> Pierre Habouzit schrieb:
>>> +say_color () {
>>> +	[ "$nocolor" = 0 ] &&  [ "$1" != '-1' ] && tput setaf "$1"
>>> +	shift
>>> +	echo "* $*"
>>> +	tput op
>>> +}
>> What if tput is not available, like on Windows? How about this (at the 
>> end of the file, so it can obey --no-color):
> 
>   I answered to it already in my first mail: if tput isn't available,
> the command fails, and $? is non 0. and nocolor is set. Or color isn't
> set to 't' for your proposal.

I was too terse, sorry. I wanted to point out that if tput is not available, 
the second invocation will leave "tput: command not found" behind on stderr. 
Therefore, I proposed to make the definition of say_color() different 
depending on whether $color is set or not. Then you don't need to test for 
$color twice inside the function.

-- Hannes

^ permalink raw reply

* Re: Howto request: going home in the middle of something?
From: Johannes Schindelin @ 2007-10-22 11:32 UTC (permalink / raw)
  To: Jan Wielemaker; +Cc: Petr Baudis, git
In-Reply-To: <200710221044.24191.wielemak@science.uva.nl>

Hi,

On Mon, 22 Oct 2007, Jan Wielemaker wrote:

> Thanks for the replies.	 I think I can live with something like this
> 
> 	<work, in the middle of something>
> 	$ git checkout -b home
> 	$ git commit
> 	$ git checkout master
> 	<arriving at home>
> 	$ git jan@work:repo fetch home:home	(using ssh)

You probably meant "git fetch jan@work:repo home:home".

> 	$ git checkout home
> 	<continue editing>
> 	$ git commit --amend
> 	$ git checkout master
> 	$ git merge home
> 	$ git -d home
> 	$ git commit
> 	$ git push
> 	<arriving at work>
> 	$ git -d home
> 	$ git pull
> 
> Its still a bit many commands and you have to be aware what you are
> doing for quite a while, but it does provide one single clean commit
> message, doesn't change the shared repo until all is finished and allows
> to abandon all work without leaving traces.
> 
> Personally I'd be more happy with
> 
> 	<work, in the middle of something>
> 	$ git stash
> 	<arriving at home>
> 	$ git stash fetch jan@work{0}	(well, some sensible syntax)
> 	$ git stash apply
> 	<continue editing>
> 	$ git commit
> 	$ git push
> 	<arriving at work>
> 	$ git pull

Happily, that is already possible:  However, instead of

	git stash fetch jan@work{0}

you should say

	git fetch jan@work stash:stash

This will only fetch the last stash, but that is what you wanted anyway, 
right?

Ciao,
Dscho

P.S.: Since you top-posted, I just ignored the mail you quoted, assuming 
that it was not relevant to your mail.

^ permalink raw reply

* Re: [PATCH] Add some fancy colors in the test library when terminal  supports it.
From: Pierre Habouzit @ 2007-10-22 11:24 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Shawn O. Pearce, git
In-Reply-To: <471C6510.8010300@viscovery.net>

[-- Attachment #1: Type: text/plain, Size: 2340 bytes --]

On Mon, Oct 22, 2007 at 08:53:36AM +0000, Johannes Sixt wrote:
> Pierre Habouzit schrieb:
> >Signed-off-by: Pierre Habouzit <madcoder@debian.org>
> >---
> >Maybe this is just me, but I don't find the output of the test-suite
> >easy to watch while scrolling. This puts some colors in proper places.
> >  * end-test summaries are in green or red depending on the sucess of
> >    the tests.
> >  * errors are in red.
> >  * skipped tests and other things that tests `say` are in brown (now
> >    you can _see_ that your testsuite skips some tests on purpose, I
> >    only noticed recently that I missed part of the environment for
> >    proper testing).
> >I'm not 100% sure the test to see if terminal supports color is correct, 
> >and
> >people using emacs shell buffer or alike tools may have better ideas on 
> >how to
> >make it.
> >and yes, I know that it "depends" upon tput, but if tput isn't 
> >available, the
> >    [ "x$TERM" != "xdumb" ] && tput hpa 60 >/dev/null 2>&1 && tput setaf 
> >1 >/dev/null 2>&1
> >expression will fail, and color will be disabled.
> > t/test-lib.sh |   32 ++++++++++++++++++++++----------
> > 1 files changed, 22 insertions(+), 10 deletions(-)
> >diff --git a/t/test-lib.sh b/t/test-lib.sh
> >index cc1253c..c6521c0 100644
> >--- a/t/test-lib.sh
> >+++ b/t/test-lib.sh
> >@@ -59,14 +59,24 @@ esac
> > # '
> > # . ./test-lib.sh
> > +[ "x$TERM" != "xdumb" ] && tput hpa 60 >/dev/null 2>&1 && tput setaf 1 
> >>/dev/null 2>&1
> >+nocolor=$?
> 
> test "x$TERM" != "xdumb" &&
> 	tput hpa 60 >/dev/null 2>&1 &&
> 	tput setaf 1 >/dev/null 2>&1 &&
> 	color=t
> 
> BTW, doesn't tput fail if stdout/stderr is not a terminal, like above?
> 
> >+
> >+say_color () {
> >+	[ "$nocolor" = 0 ] &&  [ "$1" != '-1' ] && tput setaf "$1"
> >+	shift
> >+	echo "* $*"
> >+	tput op
> >+}
> 
> What if tput is not available, like on Windows? How about this (at the 
> end of the file, so it can obey --no-color):

  I answered to it already in my first mail: if tput isn't available,
the command fails, and $? is non 0. and nocolor is set. Or color isn't
set to 't' for your proposal.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: git filter-branch --subdirectory-filter error
From: Johannes Schindelin @ 2007-10-22 11:20 UTC (permalink / raw)
  To: Jan Wielemaker; +Cc: git
In-Reply-To: <200710221227.13279.wielemak@science.uva.nl>

Hi,

On Mon, 22 Oct 2007, Jan Wielemaker wrote:

> Finished a big re-shuffle of a big project, while other developers 
> continued. Worked really well. Thanks guys! But now I have two top 
> directories and I want to create two new repositories, each containing 
> one of these directories (because the one holds copyrighted data and we 
> want the other to become public software). So, I happily run
> 
> 	$ git filter-branch --subdirectory-filter RDF HEAD
> 
> Where RDF is an existing directory.  I get:
> 
> Rewrite 95807fe01c39d3092e3ac3a98061711323154d77 (1/12)fatal: Not a valid 
> object name 95807fe01c39d3092e3ac3a98061711323154d77:RDF
> Could not initialize the index

I guess that 95807fe01 is the parent of a commit adding the RDF/ 
directory.

The subdirectory filter does not look kindly upon a history where some 
commits lack the subdirectory in question.  However, this should work:

	git filter-branch --subdirectory--filter RDF 95807fe01..HEAD

Hth,
Dscho

^ permalink raw reply

* Re: [PATCH] Let users override name of per-directory ignore file
From: Andreas Ericsson @ 2007-10-22 11:18 UTC (permalink / raw)
  To: Karl Hasselström; +Cc: git, spearce
In-Reply-To: <20071022105029.GB31862@diana.vm.bytemark.co.uk>

Karl Hasselström wrote:
> On 2007-10-15 14:09:32 +0200, Andreas Ericsson wrote:
> 
>> When collaborating with projects managed by some other scm, it often
>> makes sense to have git read that other scm's ignore-files. This
>> patch lets git do just that, if the user only tells it the name of
>> the per-directory ignore file by specifying the newly introduced git
>> config option 'core.ignorefile'.
> 
>> +	For example, setting core.ignorefile to .svnignore in
>> +	repos where one interacts with the upstream project repo
>> +	using gitlink:git-svn[1] will make a both SVN users and
>> +	your own repo ignore the same files.
> 
>> +   The name of the `.gitignore` file can be changed by setting
>> +   the configuration variable 'core.ignorefile'. This is useful
>> +   when using git for projects where upstream is using some other
>> +   SCM. For example, setting 'core.ignorefile' to `.cvsignore`
>> +   will make git ignore the same files CVS would.
> 
> I agree with what you're trying to do, but you're ignoring the fact
> that Subversion's ignore patterns (and possibly cvs's too -- I haven't
> checked) are not recursive, while the patterns in .gitignore are
> recursive per default. So using ignore patterns directly from
> Subversion ignores more files under git than the same patterns did
> under Subversion.
> 

Yes, I just got bitten by this. The top-level .cvsignore file ignores 
Makefile (since it's generated from ./configure), but Makefile exists in 
several subdirectories where it's *not* generated, but adding !Makefile 
to all those places doesn't sit too well with some of the project 
maintainers, and cvs doesn't grok /Makefile to mean "toplevel Makefile" 
(and it shouldn't since it has no notion of recursive ignores).

> One possible way to solve that would be to optionally have
> non-recursive per-directory ignore files. I haven't looked at how this
> is implemented, though, so I don't know if it's a good suggestion or
> not.
> 

I'll have a look at it. Thanks for the review.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* Re: [PATCH] git-format-patch: Don't number patches when there's only one
From: Andreas Ericsson @ 2007-10-22 11:13 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, spearce
In-Reply-To: <Pine.LNX.4.64.0710221119300.25221@racer.site>

Johannes Schindelin wrote:
> Hi,
> 
> On Mon, 22 Oct 2007, Andreas Ericsson wrote:
> 
>> Johannes Schindelin wrote:
>>
>>> On Sun, 21 Oct 2007, Andreas Ericsson wrote:
>>>
>>>> [PATCH 1/1] looks a bit silly, and automagically handling this in 
>>>> git-format-patch makes some scripting around it a lot more pleasant.
>>> I think you should not use "-n" if you do not want to have the 
>>> numbers.
>> This stems from creating scripts around it where I only want to see the 
>> numbers if there is more than a single patch. Currently I can't do that 
>> without running git-format-patch twice or re-implementing the revision 
>> parsing machinery to count revisions prior to passing arguments to 
>> format-patch.
> 
> Why not have something as simple as
> 
> 	numbered=
> 	test $(git rev-list $options | wc -l) -gt 1 && numbered=-n
> 

Because 23498~12 != 23498~12..HEAD to git rev-list, but it is to
git-format-patch, meaning I'll have to duplicate the logic in every 
script that's supposed to use it or risk introducing a third way of 
specifying a list of revisions.

> 	[...]
> 
> 	git format-patch $numbered $options
> 
> At the moment, the semantics of "--numbered" is really clear and precise.  
> And I really like that.  It makes for less surprises.
> 

Semantics could be equally clear for --numbered-if-multiple.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* best git practices, was Re: Git User's Survey 2007 unfinished summary continued
From: Johannes Schindelin @ 2007-10-22 11:04 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Jakub Narebski, Steffen Prohaska, Federico Mena Quintero, git
In-Reply-To: <471C586A.9030900@op5.se>

Hi,

On Mon, 22 Oct 2007, Andreas Ericsson wrote:

> Johannes Schindelin wrote:
> 
> > I'd really like people to respond not so much with broad and general 
> > statements to my mail (those statements tend to be rather useless to 
> > find how to make git more suitable to newbies), but rather with 
> > concrete top ten lists of what they do daily.
> > 
> > My top ten list:
> > 
> > - git diff
> > - git commit
> > - git status
> > - git fetch
> > - git rebase
> > - git pull
> > - git cherry-pick
> > - git bisect
> > - git push
> > - git add
> > 
> > So again, I'd like people who did _not_ tweak git to their likings to 
> > tell the most common steps they do.  My hope is that we see things 
> > that are good practices, but could use an easier user interface.
> 
> I'm not so sure we'd want to hide commands that git-gurus simply do not 
> use, such as git-blame.

I was not talking about commands that git gurus simply do not use.  I 
explicitely avoided asking "git gurus" for what they use.

> In my opinion, we should just locate the highest level available of UI 
> tool that implements a particular feature and have that listed in the 
> git[- ]<tab> view.

>From the survey it is utterly clear that the available UI tools are still 
not good enough.

So once again, what operations involving git do people use regularly?

<rationale>There is a good chance that git is not optimised for most 
people's daily workflows, as project maintainers seemed to be much more 
forthcoming with patches, and therefore maintainers' tasks are much more 
optimised than in other SCMs.</rationale>

Ciao,
Dscho

P.S.: If nobody replies with actual daily workflows to this mail, I'll 
just assume that this complaint in the user survey was just bullocks, and 
no change in git is needed.

^ permalink raw reply

* Re: [PATCH] Be nice with compilers that do not support runtime paths at all.
From: Johannes Schindelin @ 2007-10-22 10:52 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Benoit SIGOURE, git list
In-Reply-To: <20071022064454.GV14735@spearce.org>

Hi,

On Mon, 22 Oct 2007, Shawn O. Pearce wrote:

> Benoit SIGOURE <tsuna@lrde.epita.fr> wrote:
> > >On Oct 4, 2007, at 1:18 AM, Junio C Hamano wrote:
> > >>Benoit Sigoure <tsuna@lrde.epita.fr> writes:
> > >>
> > >>If we do not care about supporting too old GNU make, we can do
> > >>this by first adding this near the top:
> > >>
> > >>        ifndef NO_RPATH
> > >>        LINKER_PATH = -L$(1) $(CC_LD_DYNPATH)$(1)
> > >>        else
> > >>        LINKER_PATH = -L$(1)
> > >>        endif
> > >>
> > >>and then doing something like:
> > >>
> > >>	CURL_LIBCURL = $(call LINKER_PATH,$(CURLDIR)/$(lib))
> > >>	OPENSSL_LINK = $(call LINKER_PATH,$(OPENSSLDIR)/$(lib))
> > >>
> > >>to make it easier to read and less error prone.
> > >
> > >Yes.  I can rework the patch, but the question is: do you care  
> > >about old GNU make?  Can I rewrite the patch with this feature?
> > 
> > I know Junio is still offline but maybe someone else has an objection 
> > against this?
> 
> How old of a GNU make are talking about here?  The above is certainly a 
> lot nicer to read, but I'd hate to suddenly ship a new Git that someone 
> cannot compile because their GNU make is too old.

I seem to remember remember that we had some shell quoting in the 
Makefile, and it was "call"ed.  That broke some setups, so we got rid of 
it.

*starting "git log -Scall Makefile"*: yep.  It even was me fixing it, in 
39c015c556f285106931e0500f301de462b0e46e.

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH] Let users override name of per-directory ignore file
From: Karl Hasselström @ 2007-10-22 10:50 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git, spearce
In-Reply-To: <20071021144542.8855A5BB85@nox.op5.se>

On 2007-10-15 14:09:32 +0200, Andreas Ericsson wrote:

> When collaborating with projects managed by some other scm, it often
> makes sense to have git read that other scm's ignore-files. This
> patch lets git do just that, if the user only tells it the name of
> the per-directory ignore file by specifying the newly introduced git
> config option 'core.ignorefile'.

> +	For example, setting core.ignorefile to .svnignore in
> +	repos where one interacts with the upstream project repo
> +	using gitlink:git-svn[1] will make a both SVN users and
> +	your own repo ignore the same files.

> +   The name of the `.gitignore` file can be changed by setting
> +   the configuration variable 'core.ignorefile'. This is useful
> +   when using git for projects where upstream is using some other
> +   SCM. For example, setting 'core.ignorefile' to `.cvsignore`
> +   will make git ignore the same files CVS would.

I agree with what you're trying to do, but you're ignoring the fact
that Subversion's ignore patterns (and possibly cvs's too -- I haven't
checked) are not recursive, while the patterns in .gitignore are
recursive per default. So using ignore patterns directly from
Subversion ignores more files under git than the same patterns did
under Subversion.

One possible way to solve that would be to optionally have
non-recursive per-directory ignore files. I haven't looked at how this
is implemented, though, so I don't know if it's a good suggestion or
not.

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply

* Re: [PATCH] execv_git_cmd(): also try PATH if everything else fails.
From: Johannes Schindelin @ 2007-10-22 10:35 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Parish, git
In-Reply-To: <20071022042110.GJ14735@spearce.org>

Hi,

On Mon, 22 Oct 2007, Shawn O. Pearce wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Earlier, we tried to find the git commands in several possible exec
> > dirs.  Now, if all of these failed, try to find the git command in
> > PATH.
> ...
> > diff --git a/exec_cmd.c b/exec_cmd.c
> > index 9b74ed2..70b84b0 100644
> > --- a/exec_cmd.c
> > +++ b/exec_cmd.c
> > @@ -36,7 +36,8 @@ int execv_git_cmd(const char **argv)
> >  	int i;
> >  	const char *paths[] = { current_exec_path,
> >  				getenv(EXEC_PATH_ENVIRONMENT),
> > -				builtin_exec_path };
> > +				builtin_exec_path,
> > +				"" };
> 
> So if the user sets GIT_EXEC_PATH="" and exports it we'll search $PATH 
> before the builtin exec path that Git was compiled with? Are we sure we 
> want to do that?

I thought the proper way to unset EXEC_PATH was to "unset GIT_EXEC_PATH".  
In that case, getenv(EXEC_PATH_ENVIRONMENT) returns NULL and we're fine, 
no?

Ciao,
Dscho

^ permalink raw reply

* Re: Linear time/space rename logic for *inexact* case
From: Jeff King @ 2007-10-22 10:34 UTC (permalink / raw)
  To: Andy C; +Cc: git
In-Reply-To: <596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com>

On Mon, Oct 22, 2007 at 02:40:08AM -0700, Andy C wrote:

> I just subscribed to this list, so sorry I can't respond to the
> threads already started here.  I'm the guy that was mailing Linus
> about this algorithm to do similarity detection in linear time,
> mentioned here:

Great, welcome aboard. :)

> To avoid the m*n memory usage of step 2, I use a hash table which maps
> 2-tuples to counts to represent the sparse similarity matrix, instead
> of representing it directly.  The 2-tuple is a pair of filenames,
> corresponding to the row/column of the matrix, and the counts/values
> are the entries in the matrix.

OK, makes sense (that's what I was trying to propose near the end of my
mail).

> You can also prune entries which have a long "postings list" (using
> the term in the IR sense here:
> http://www.xapian.org/docs/intro_ir.html).  This has the nice side
> effect of getting rid of quadratic behavior, *and* making the
> algorithm more accurate because it stops considering common lines like
> "#endif" as contributing to similarity.

Ah, very clever. I think that should have nice behavior most of the
time, though I wonder about a pathological case where we have many
files, all very similar to each other, and then have a commit where they
all start to diverge, but just by a small amount, while getting moved.
We would end up with an artificially low score between renamed files
(because we've thrown out all of the commonality) which may lead us to
believe there was no rename at all.

But it might be worth ignoring that case.

> This pruning of common lines from the index makes step 3 linear.  In
> fact, I prune the index to only include lines which occur just *once*.
>  Nearly every line of code in real data sets is unique, so this works
> well in practice.

Makes sense.

> http://marc.info/?l=git&m=118975452330644&w=2
> "Of course, 2 minutes for a git-status is still way too slow, so there we
> might still need a limiter. It also looks like 57% of my time is spent
> in spanhash_find, and another 29% in diffcore_count_changes."
> 
> I know there have been a couple fixes since you posted that, but if it
> was the O(m*n) behavior that was causing your problem, it should still
> be reproducible.  Linus suggested that this is a good data set to try
> things out with.  Is there there a repository I can clone with git
> commands to run to repro this?

Yes, I still have the problem (the 2 minutes was _after_ we did fixes,
down from 20 minutes; the latest patch from Linus just covers the "exact
match" case where two files have identical SHA1s).

It's a 1.2G repo at the end of a slow DSL line, so rather than cloning
it, here's a way to reproduce a repo with similar properties:

-- >8 --
#!/bin/sh
rm -rf repo
mkdir repo && cd repo

# seq and openssl aren't portable, but the
# point is to generate 200 random 1M files
for i in `seq -f %03g 1 600`; do
  openssl rand 100000 >$i.rand
done

# make repo, fully packed
# we don't bother trying to delta in the pack
# since the files are all random
git-init
git-add .
git-commit -q -m one
git-repack -a -d --window=0

# move every file
mkdir new
git-mv *.rand new

# modify every file
for i in new/*.rand; do
  echo foo >>$i
done
git-add new

# this is the operation of interest
time git-diff --cached --raw -M -l0 >/dev/null

-- >8 --

The idea is to have a large number of files that are slightly changed
and moved, and to try to find the pairs.  The diff takes about 20
seconds to run for me (the real repo has 1M files rather than 100K
files, but it's nice to have the tests take a lot less time). If you
want a bigger test, bump up the file size (or increase the number of
files, which will emphasize the quadratic behavior).

> 3) Compute the similarity metric, which I've defined here as
> max(c/|left file|, c/|right file|), where c the entry in the matrix
> for the file pair.  Note that the files are treated as *sets* of lines
> (unordered, unique).  The similarity score is a number between 0.0 and
> 1.0.  Other similarity metrics are certainly possible.

We have to handle binary files, too. In the current implementation, we
consider either lines or "chunks", and similarity is increased by the
size of the chunk.

> * Some people might be concerned that it treats files as unordered
> sets of lines.  The first thought might be to do this as a
> preprocessing step to cull the list of candidates, and then do a real
> delta.  But in my experience, I haven't encountered a case where
> there's all that much to gain by doing that.

I think we are already treating the files as unordered sets of lines.
And really, I think there is some value in that, anyway. If I reverse
the order of all lines in a file, it might be useful for git to say
"this file came from that file".

> * This can be run on all files, not just adds/deletes.  If I have a

Great. git has a "look for copies also" flag, but it is usually disabled
because of the computational cost. If we can get it low enough, it might
actually become a lot more useful.

> If anything about the explanation is unclear, let me know and I will
> try to clarify it.  Playing around with the demo should illuminate
> what it does.  You can run it on data sets of your own.  All you need
> is 2 source trees and the "find" command to produce input to the
> script (see setup_demo.sh).

I'll try it on my test data, but it sounds like it doesn't really handle
binary files.

> As mentioned, I will try to do some tests on this, perhaps with Jeff's
> hard data set, and show that the results are good and that the
> algorithm is faster because the quadratic behavior is gone (if Linus
> doesn't beat me to it!).

Trying to fit it into the C git code would be useful, but I doubt I'll
have time to work on it tonight, since it's getting onto dawn here.

-Peff

^ permalink raw reply

* git filter-branch --subdirectory-filter error
From: Jan Wielemaker @ 2007-10-22 10:27 UTC (permalink / raw)
  To: git

Hi,

Finished a big re-shuffle of a big project, while other developers
continued. Worked really well. Thanks guys! But now I have two top
directories and I want to create two new repositories, each containing
one of these directories (because the one holds copyrighted data and we
want the other to become public software). So, I happily run

	$ git filter-branch --subdirectory-filter RDF HEAD

Where RDF is an existing directory.  I get:

Rewrite 95807fe01c39d3092e3ac3a98061711323154d77 (1/12)fatal: Not a valid 
object name 95807fe01c39d3092e3ac3a98061711323154d77:RDF
Could not initialize the index

I tried the procedure on some smaller test projects and it all worked
just fine. Running git version 1.5.3.4 on SuSE Linux. Also ran "git fsck
--full", which completed without any message.

Git show says:

gollem (eculture) 121_> git show 95807fe01c39d3092e3ac3a98061711323154d77 | 
cat
commit 95807fe01c39d3092e3ac3a98061711323154d77
Merge: 76d2935... 58afb98...
Author: Jan Wielemaker <wielemak@science.uva.nl>
Date:   Thu Oct 18 17:32:22 2007 +0200

    Merge branch 'master' of /home/eculture/eculture

Any clue?  

	Thanks --- Jan

^ permalink raw reply

* Re: .gittattributes handling has deficiencies
From: Johannes Schindelin @ 2007-10-22 10:29 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Shawn O. Pearce, david, git
In-Reply-To: <565E1D52-59C4-4EB8-AA81-FF94F346FE61@zib.de>

Hi,

On Mon, 22 Oct 2007, Steffen Prohaska wrote:

> .gitattributes is first looked for in the working directory, and if not 
> there, .gitattributes is read from the index.

Of course we could change that to do it the other way round.  But this 
would contradict expectations when you edit .gitattributes and then 
checkout single files without having git-add'ed .gitattributes first.

The biggest problem in your setup, however, is not if .gitattributes is 
read from the index or the working directory.  The biggest problem is that 
files are not touched when their contents have not changed.

IOW if you have .gitattributes in the to-be-checked-out branch which say 
that README is crlf, and in the current branch it is not, and README's 
_contents_ are identical in both branches, a "git checkout 
<that-other-branch>" will not rewrite README, and consequently not change 
the working copy to crlf.

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH] git-format-patch: Don't number patches when there's only one
From: Johannes Schindelin @ 2007-10-22 10:22 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git, spearce
In-Reply-To: <471C77F0.5050701@op5.se>

Hi,

On Mon, 22 Oct 2007, Andreas Ericsson wrote:

> Johannes Schindelin wrote:
> 
> > On Sun, 21 Oct 2007, Andreas Ericsson wrote:
> > 
> > > [PATCH 1/1] looks a bit silly, and automagically handling this in 
> > > git-format-patch makes some scripting around it a lot more pleasant.
> > 
> > I think you should not use "-n" if you do not want to have the 
> > numbers.
> 
> This stems from creating scripts around it where I only want to see the 
> numbers if there is more than a single patch. Currently I can't do that 
> without running git-format-patch twice or re-implementing the revision 
> parsing machinery to count revisions prior to passing arguments to 
> format-patch.

Why not have something as simple as

	numbered=
	test $(git rev-list $options | wc -l) -gt 1 && numbered=-n

	[...]

	git format-patch $numbered $options

At the moment, the semantics of "--numbered" is really clear and precise.  
And I really like that.  It makes for less surprises.

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH] git-format-patch: Don't number patches when there's only one
From: Andreas Ericsson @ 2007-10-22 10:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, spearce
In-Reply-To: <Pine.LNX.4.64.0710221044080.25221@racer.site>

Johannes Schindelin wrote:
> Hi,
> 
> On Sun, 21 Oct 2007, Andreas Ericsson wrote:
> 
>> [PATCH 1/1] looks a bit silly, and automagically handling this in 
>> git-format-patch makes some scripting around it a lot more pleasant.
> 
> I think you should not use "-n" if you do not want to have the numbers.  

This stems from creating scripts around it where I only want to see the 
numbers if there is more than a single patch. Currently I can't do that 
without running git-format-patch twice or re-implementing the revision 
parsing machinery to count revisions prior to passing arguments to 
format-patch.

> In circumstances as yours, where you can have patch series larger than 
> one, I imagine that the "[PATCH 1/1]" bears an important information, 
> which is not present in "[PATCH]": this patch series contains only one 
> patch.
> 

That's sort of implicit in [PATCH], since all patch-series added by the 
tool I'm scripting will have [PATCH n/m], while I'd prefer for 
single-patches to have only [PATCH].

> IOW I do not like your patch: too much DWIDNS (Do What I Did NOT Say) for 
> me.
> 

Would a separate option be acceptable to you?

It doesn't have to be fancy or short, since I really only mean to use it 
from our scripts.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* Re: Linear time/space rename logic for *inexact* case
From: Andy C @ 2007-10-22 10:09 UTC (permalink / raw)
  To: git
In-Reply-To: <596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com>

On 10/22/07, Andy C <andychup@gmail.com> wrote:
> So the algorithm is:

I think I can make this a lot clearer than I did, while glossing over
some details and the line_threshold parameter.

1) Make a "left index" and a "right index" out of the 2 sets of files,
{ line => [list of docs] }.

2) Remove any lines that appear in more than one doc from the left
index.  Do the same for the right index.  (this corresponds to
line_threshold=1 case)

3) For all lines, if the line appears in *both* the left index and the
right index, increment the count of the (row=doc from left set,
column=doc from right set) entry in the similarity matrix by 1.  The
matrix is represented by a hash of 2-tuples => counts.

After this is done for all lines, then the matrix is sparsely filled
with the count of common lines between every pair of files in the 2
sets.  The vast majority of cells in the matrix are implicitly 0 and
thus consume neither memory nor CPU with the hash table representation
of matrix.

4) Then you can use this to compute similarity scores.

Hopefully that is more clear... though I guess it might not be obvious
that it works for the problem that git has.  I am fairly sure it does,
but the demo should allow us to evaluate that.

Andy

^ permalink raw reply

* Re: [PATCH] git-format-patch: Don't number patches when there's only one
From: Johannes Schindelin @ 2007-10-22  9:44 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git, spearce
In-Reply-To: <20071021144637.762085BB92@nox.op5.se>

Hi,

On Sun, 21 Oct 2007, Andreas Ericsson wrote:

> [PATCH 1/1] looks a bit silly, and automagically handling this in 
> git-format-patch makes some scripting around it a lot more pleasant.

I think you should not use "-n" if you do not want to have the numbers.  
In circumstances as yours, where you can have patch series larger than 
one, I imagine that the "[PATCH 1/1]" bears an important information, 
which is not present in "[PATCH]": this patch series contains only one 
patch.

IOW I do not like your patch: too much DWIDNS (Do What I Did NOT Say) for 
me.

Ciao,
Dscho

^ permalink raw reply

* Linear time/space rename logic for *inexact* case
From: Andy C @ 2007-10-22  9:40 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 9134 bytes --]

I just subscribed to this list, so sorry I can't respond to the
threads already started here.  I'm the guy that was mailing Linus
about this algorithm to do similarity detection in linear time,
mentioned here:

Subject:    [PATCH] Split out "exact content match" phase of rename detection
http://marc.info/?l=git&m=119299209323278&w=2

Subject:    [PATCH, take 1] Linear-time/space rename logic (exact renames
http://marc.info/?l=git&m=119301122908852&w=2

Jeff, in this message http://marc.info/?l=git&m=119303566130201&w=2 I
think you basically hit on the first half of what I was getting at.

The step 1 you describe is the first step of the algorithm -- make an
inverted index of lines (and you can use a hash code of the line to
stand in for the line).

To avoid the m*n memory usage of step 2, I use a hash table which maps
2-tuples to counts to represent the sparse similarity matrix, instead
of representing it directly.  The 2-tuple is a pair of filenames,
corresponding to the row/column of the matrix, and the counts/values
are the entries in the matrix.

You can also prune entries which have a long "postings list" (using
the term in the IR sense here:
http://www.xapian.org/docs/intro_ir.html).  This has the nice side
effect of getting rid of quadratic behavior, *and* making the
algorithm more accurate because it stops considering common lines like
"#endif" as contributing to similarity.

This pruning of common lines from the index makes step 3 linear.  In
fact, I prune the index to only include lines which occur just *once*.
 Nearly every line of code in real data sets is unique, so this works
well in practice.

I already sent this demo to Linus, but I think it's worth sending to
the list as well.  I am just going to copy parts of my earlier e-mails
below, and attach the same demos (hopefully it is kosher to send
attachments to this list).

Before I do that, Jeff, can you still reproduce this problem:

http://marc.info/?l=git&m=118975452330644&w=2
"Of course, 2 minutes for a git-status is still way too slow, so there we
might still need a limiter. It also looks like 57% of my time is spent
in spanhash_find, and another 29% in diffcore_count_changes."

I know there have been a couple fixes since you posted that, but if it
was the O(m*n) behavior that was causing your problem, it should still
be reproducible.  Linus suggested that this is a good data set to try
things out with.  Is there there a repository I can clone with git
commands to run to repro this?

OK, so attached is a little demo of the algorithm, which is (very
little) Python code, but with comments so non-Python people can
hopefully follow it.  Because of this the timings are not very
meaningful, but it proves that the algorithm doesn't blow up.

I ran it on the entire Linux 2.4 vs. Linux 2.6 codebases.  It is
*only* considering file content.  You can rename every file in both
source trees to completely random strings and it will still match
files up.  There is nothing about filenames, or identical files, and
you can consider the whole 2.4 side "all deletes" and the whole 2.6
side "all adds".  The size of the matrix would be around 286 million
cells, but here I only represent the non-zero entries in the matrix,
which is only 15,406 cells.

$ wc -l similarity_demo.py
233 similarity_demo.py

$ ./similarity_demo.py in-all-*
12697 * 22530 = 286063410 total possible pairs of files
Indexing the first set of 12697 files (threshold=1)
Indexing the second set of 22530 files (threshold=1)
Sum of file sizes in first set: 2134249 lines
Sum of file sizes in second set: 3424338 lines
Size of index over first set: 2134249
Size of index over second set: 3424338
Computing union of lines in the indices
Total unique lines in both indices: 4384375
Making sparse common line matrix
Calculating similarity for 15406 pairs of files
Sorting 15406 similar pairs
Writing similarity report
Wrote out-in-all-24-vs-in-all-26.1.txt
End
------
Report
------
Indexing the first set of 12697 files (threshold=1)          29.540s
Indexing the second set of 22530 files (threshold=1)         111.041s
Computing union of lines in the indices                      7.450s
Making sparse common line matrix                             13.468s
Calculating similarity for 15406 pairs of files              0.055s
Sorting 15406 similar pairs                                  0.030s
Writing similarity report                                    0.249s
Total time                                                   161.834s

The script outputs a text file with 15,406 similar pairs, in order of
similarity (1.0's are at the top):

andychu demo$ wc -l out-in-all-24-vs-in-all-26.1.txt
15406 out-in-all-24-vs-in-all-26.1.txt

andychu demo$ head -n3 out-in-all-24-vs-in-all-26.1.txt
(  51) linux-2.4.35.3/include/asm-m68k/apollodma.h
(  51) linux-2.6.23.1/include/asm-m68k/apollodma.h   51 1.000

(  94) linux-2.4.35.3/fs/jfs/jfs_btree.h
(  94) linux-2.6.23.1/fs/jfs/jfs_btree.h   94 1.000

(  21) linux-2.4.35.3/Documentation/fb/00-INDEX
(  21) linux-2.6.23.1/Documentation/fb/00-INDEX   21 1.000
...

And here is my explanation from an earlier mail, with some slight edits:

So the algorithm is:

1) Make an inverted index of the left set and right set.  That is {
line => ["postings list", i.e. the list of files the line appears in]
}.  To get rid of common lines like "} else {" or "#endif", there is
an arbitrary "line threshold".

2) Combine the 2 indices into a (sparse) rectangular matrix.  For each
line, iterate over all pairs in the postings list, and increment the
cell in the matrix for that pair by 1.  The index is extremely
shallow, since nearly all lines of code are unique.  The common case
is that the postings list is of length 1.  And the line threshold caps
the length of the postings list.

In the code the matrix represented by a hash of filename pairs to
integer counts.  So then the count is the number of lines that the 2
files have in common.

3) Compute the similarity metric, which I've defined here as
max(c/|left file|, c/|right file|), where c the entry in the matrix
for the file pair.  Note that the files are treated as *sets* of lines
(unordered, unique).  The similarity score is a number between 0.0 and
1.0.  Other similarity metrics are certainly possible.

A few things to notice about this algorithm:

* It takes advantage of the fact that code edits are typically
line-oriented, and nearly every line of code is unique.  (This same
technique can be used for arbitrary documents like English text, but
it's a bit harder since you basically have to find a way to make a
"deep" index of words shallow, to speed it up.  For code, the index of
lines is already shallow.)

* Some people might be concerned that it treats files as unordered
sets of lines.  The first thought might be to do this as a
preprocessing step to cull the list of candidates, and then do a real
delta.  But in my experience, I haven't encountered a case where
there's all that much to gain by doing that.

* The line_threshold might appear to be a hack, but it actually
*improves* accuracy.  If you think about it, lines like "#endif"
should not contribute to the similarity between 2 files.  It also
makes it impossible to construct pathological blow-up cases.  If you
have 1000 files on the left and 1000 files on the right that are all
identical to each other, then every line will get pruned, and thus the
entire similarity matrix will be 0, which is arguably what you want.
There is a -l flag to the script to experiment with this threshold.

* This can be run on all files, not just adds/deletes.  If I have a
change of only edits, it could be the case that I moved 500 lines from
one file 1000 line file to another 1000 line file, and changed 75
lines within the 500.  It would be meaningful to see a diff in this
case, so that I can see those 75 edits (and a great feature!)

* The similarity is defined that way so that if one file is completely
contained in another, the similarity is 1.0.  So if I move a 1000 line
file and add 9,000 lines to it, the similarity for the file pair will
still be 1.0.  I believe this is a feature, like the point above.

* I don't have a C implementation but I believe the constant factors
should be very low.  You could use the same CRC you were talking about
to reduce memory in storing the lines.  It seems like this algorithm
is amenable to trading memory for speed, as you mention.  Since it is
raw string manipulation, C should be at least 10x faster than Python,
and I wouldn't be surprised if an optimized implementation is 50 or
100x faster.

...

If anything about the explanation is unclear, let me know and I will
try to clarify it.  Playing around with the demo should illuminate
what it does.  You can run it on data sets of your own.  All you need
is 2 source trees and the "find" command to produce input to the
script (see setup_demo.sh).

As mentioned, I will try to do some tests on this, perhaps with Jeff's
hard data set, and show that the results are good and that the
algorithm is faster because the quadratic behavior is gone (if Linus
doesn't beat me to it!).

thanks,
Andy

[-- Attachment #2: setup_demo.sh --]
[-- Type: application/x-sh, Size: 983 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: similarity_demo.py --]
[-- Type: text/x-python; name=similarity_demo.py, Size: 7823 bytes --]

#!/usr/bin/python2.4

"""
Demo of linear-time code similarity calculation.

Author: Andy Chu (andychu at google dot com)
"""

import optparse
import os
import md5
import re
import sys
import time
from pprint import pprint

class MultiTimer:
  """To give a general idea of how long each step takes."""

  def __init__(self, loud=False):
    self.timestamps = []
    self.descriptions = []

  def checkpoint(self, description=''):
    print description
    self.timestamps.append(time.time())
    self.descriptions.append(description)

  def report(self):
    print "------"
    print "Report"
    print "------"

    # Compute time differences
    format = "%-60s %.3fs"
    for i in xrange(len(self.timestamps)-1):
      interval = self.timestamps[i+1] - self.timestamps[i]
      print format % (self.descriptions[i], interval)
    print format % ('Total time', self.timestamps[-1] - self.timestamps[0])

def IndexFiles(files, line_threshold):
  """
  Given a list of filenames, produce an inverted index.

  Any lines which occur in "line_threshold" files are thrown out.

  Also, for similarity scoring, keep track of the number of important lines in
  each file.  Files are treated as (unordered) *sets* of (unique) lines.
  Unimportant lines are duplicates and lines which exceed line_threshold.
  """
  # { line -> list of filenames the line appears in }
  index = {}  
  # { filename -> number of unique lines in the file }
  sizes = {}

  for filename in files:
    try:
      f = file(filename)
    except IOError, e:
      print e, filename
      sys.exit(1)

    for line in f:
      line = line.strip()  # Could remove *all* whitespace here
      if line:  # Skip blank lines
        if line in index:
          filelist = index[line]
          # Stop keeping track once we reach the threshold
          if len(filelist) == line_threshold + 1:
            continue
          # Only count the first occurrence of the line in the file
          if filelist[-1] != filename:
            filelist.append(filename)
            sizes[filename] = sizes.get(filename, 0) + 1
        else:
          index[line] = [filename]
          sizes[filename] = sizes.get(filename, 0) + 1
    f.close()

  # Now remove any lines that hit the threshold from the index, and adjust the
  # file sizes.
  to_remove = []
  for line, filelist in index.iteritems():
    if len(filelist) == line_threshold + 1:
      to_remove.append(line)
      for f in filelist:
        sizes[f] -= 1
  for line in to_remove:
    del index[line]

  return index, sizes

def FindSimilarPairs(set1, set2, reportfile, line_threshold):
  """Calculates pairwise similarity between two sets of files, using no other
  information but the contents of the files (i.e. no filename information).

  Args:
    set1, set2: The 2 lists of file system paths to compare.
    reportfile: name of the output text file
    line_threshold:
      We prune the index of entries that occur more than this number of times.
      For example, the line "} else {" may occur very frequently in the code,
      but we just throw it out, since the fact that 2 files have this line in
      common is not very meaningful.

      Making line_threshold a constant also makes the algorithm linear.  In
      practice, with real data sets, the results should be quite stable with
      respect to this parameter.  That is, they should not vary very much at
      all if line_threshold = 1 or 100 or 1000.
  """
  mt = MultiTimer()

  print '%s * %s = %s total possible pairs of files' % (
      len(set1), len(set2), len(set1) * len(set2))

  #
  # 1. Generate inverted indices and size information for each set.
  #
  mt.checkpoint("Indexing the first set of %d files (threshold=%d)" % (
    len(set1), line_threshold))
  index1, sizes1 = IndexFiles(set1, line_threshold)

  mt.checkpoint("Indexing the second set of %d files (threshold=%d)" % (
    len(set2), line_threshold))
  index2, sizes2 = IndexFiles(set2, line_threshold)

  print "Sum of file sizes in first set: %s lines" % sum(sizes1.values())
  print "Sum of file sizes in second set: %s lines" % sum(sizes2.values())
  print "Size of index over first set:", len(index1)
  print "Size of index over second set:", len(index2)

  #
  # 2. Combine the 2 indices to form sparse matrix that counts common lines.
  #
  # Pairs which do not appear have an implicit count of 0.
  #
  # This is a sparse matrix represented as a dictionary of 2-tuples:
  # (filename # in set1, filename in set2) -> number of (unique) lines that
  # they have in common

  mt.checkpoint("Computing union of lines in the indices")
  all_lines = set(index1) | set(index2)
  print 'Total unique lines in both indices: %s' % len(all_lines)

  mt.checkpoint("Making sparse common line matrix")

  matrix = {}
  for line in all_lines:
    files1 = index1.get(line)
    files2 = index2.get(line)
    if files1 and files2:
      # For every pair of files that contain this line, increment the
      # corresponding cell in the common line matrix.  
      #
      # Since we pruned the index, this whole double loop is constant time.  If
      # the line_threshold is 1 (default), then the whole thing is just a
      # single iteration.  
      for f1 in files1:
        for f2 in files2:
          matrix[(f1, f2)] = matrix.get((f1, f2), 0) + 1

  mt.checkpoint("Calculating similarity for %s pairs of files" % len(matrix))

  #
  # 3. Calculate the similarity of each cell in the matrix.
  #
  # The similarity metric is number of common lines divided by the file size
  # (and take the greater of the 2).  Note that this means that if file1 is
  # entirely *contained* in file2, then the similarity of the pair will be 1.0.
  # This is desirable since you can detect if 99,000 lines were added to a 1,000
  # line file, etc.  Other similarity metrics are possible.
  #
  similar_pairs = []
  for (file1, file2), num_common in matrix.iteritems():
    c = float(num_common)
    # TODO: Write notes on similarity metric
    similarity = max(c / sizes1[file1], c / sizes2[file2])
    similar_pairs.append((file1, file2, num_common, similarity))

  mt.checkpoint("Sorting %d similar pairs" % len(similar_pairs))
  similar_pairs.sort(key=lambda x: x[3], reverse=True)

  mt.checkpoint("Writing similarity report")
  f = open(reportfile, 'w')
  for file1, file2, num_common, similarity in similar_pairs:
    # Put a * after entries where the relative paths are *not* the same, just
    # for convenience when looking at the report.
    if file1.split('/')[1:] != file2.split('/')[1:]:
      mark = '*'
    else:
      mark = ''

    print >> f, '(%4d) %-40s (%4d) %-40s %4d %.3f %s' % (
        sizes1[file1], file1, sizes2[file2], file2, num_common, similarity,
        mark)
  f.close()
  print 'Wrote %s' % reportfile

  mt.checkpoint("End")
  mt.report()

if __name__ == '__main__':
  parser = optparse.OptionParser()

  parser.add_option(
      '-l', '--line-threshold', dest='line_threshold', type='int', default=1,
      help='ignore lines that occur more than this number of times in either '
      'set')
  parser.add_option(
      '-o', '--out', dest='reportfile', default=None,
      help='Write output to this file instead of the default.')

  (options, argv) = parser.parse_args()

  try:
    left, right = argv[:2]
  except ValueError:
    print 'Pass 2 files containing paths as arguments (left tree, right tree).'
    sys.exit(1)

  if not options.reportfile:
    options.reportfile = 'out-%s-vs-%s.%s.txt' % (
        os.path.splitext(os.path.basename(left))[0],
        os.path.splitext(os.path.basename(right))[0],
        options.line_threshold)

  set1 = [f.strip() for f in open(left).readlines()]
  set2 = [f.strip() for f in open(right).readlines()]
  FindSimilarPairs(set1, set2, options.reportfile, options.line_threshold)

^ permalink raw reply

* Re: [PATCH] Add some fancy colors in the test library when terminal supports it.
From: Johannes Sixt @ 2007-10-22  8:53 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Shawn O. Pearce, git
In-Reply-To: <20071022081341.GC32763@artemis.corp>

Pierre Habouzit schrieb:
> Signed-off-by: Pierre Habouzit <madcoder@debian.org>
> ---
> 
> Maybe this is just me, but I don't find the output of the test-suite
> easy to watch while scrolling. This puts some colors in proper places.
> 
>   * end-test summaries are in green or red depending on the sucess of
>     the tests.
>   * errors are in red.
>   * skipped tests and other things that tests `say` are in brown (now
>     you can _see_ that your testsuite skips some tests on purpose, I
>     only noticed recently that I missed part of the environment for
>     proper testing).
> 
> I'm not 100% sure the test to see if terminal supports color is correct, and
> people using emacs shell buffer or alike tools may have better ideas on how to
> make it.
> 
> and yes, I know that it "depends" upon tput, but if tput isn't available, the
>     [ "x$TERM" != "xdumb" ] && tput hpa 60 >/dev/null 2>&1 && tput setaf 1 >/dev/null 2>&1
> expression will fail, and color will be disabled.
> 
>  t/test-lib.sh |   32 ++++++++++++++++++++++----------
>  1 files changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/t/test-lib.sh b/t/test-lib.sh
> index cc1253c..c6521c0 100644
> --- a/t/test-lib.sh
> +++ b/t/test-lib.sh
> @@ -59,14 +59,24 @@ esac
>  # '
>  # . ./test-lib.sh
>  
> +[ "x$TERM" != "xdumb" ] && tput hpa 60 >/dev/null 2>&1 && tput setaf 1 >/dev/null 2>&1
> +nocolor=$?

test "x$TERM" != "xdumb" &&
	tput hpa 60 >/dev/null 2>&1 &&
	tput setaf 1 >/dev/null 2>&1 &&
	color=t

BTW, doesn't tput fail if stdout/stderr is not a terminal, like above?

> +
> +say_color () {
> +	[ "$nocolor" = 0 ] &&  [ "$1" != '-1' ] && tput setaf "$1"
> +	shift
> +	echo "* $*"
> +	tput op
> +}

What if tput is not available, like on Windows? How about this (at the end 
of the file, so it can obey --no-color):

if test "$color"; then
	say_color () {
		test "$1" != '-1' && tput setaf "$1"
		shift
		echo "* $*"
		tput op
	}
else
	say_color() {
		shift
		echo "* $*"
	}
fi

> +	--no-color)
> +	    nocolor=1; shift ;;

	    color=; shift ;;

-- Hannes "We don't need no double negation"

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox