* Re: Linear time/space rename logic for *inexact* case
From: Jeff King @ 2007-10-22 10:34 UTC (permalink / raw)
To: Andy C; +Cc: git
In-Reply-To: <596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com>
On Mon, Oct 22, 2007 at 02:40:08AM -0700, Andy C wrote:
> I just subscribed to this list, so sorry I can't respond to the
> threads already started here. I'm the guy that was mailing Linus
> about this algorithm to do similarity detection in linear time,
> mentioned here:
Great, welcome aboard. :)
> To avoid the m*n memory usage of step 2, I use a hash table which maps
> 2-tuples to counts to represent the sparse similarity matrix, instead
> of representing it directly. The 2-tuple is a pair of filenames,
> corresponding to the row/column of the matrix, and the counts/values
> are the entries in the matrix.
OK, makes sense (that's what I was trying to propose near the end of my
mail).
> You can also prune entries which have a long "postings list" (using
> the term in the IR sense here:
> http://www.xapian.org/docs/intro_ir.html). This has the nice side
> effect of getting rid of quadratic behavior, *and* making the
> algorithm more accurate because it stops considering common lines like
> "#endif" as contributing to similarity.
Ah, very clever. I think that should have nice behavior most of the
time, though I wonder about a pathological case where we have many
files, all very similar to each other, and then have a commit where they
all start to diverge, but just by a small amount, while getting moved.
We would end up with an artificially low score between renamed files
(because we've thrown out all of the commonality) which may lead us to
believe there was no rename at all.
But it might be worth ignoring that case.
> This pruning of common lines from the index makes step 3 linear. In
> fact, I prune the index to only include lines which occur just *once*.
> Nearly every line of code in real data sets is unique, so this works
> well in practice.
Makes sense.
> http://marc.info/?l=git&m=118975452330644&w=2
> "Of course, 2 minutes for a git-status is still way too slow, so there we
> might still need a limiter. It also looks like 57% of my time is spent
> in spanhash_find, and another 29% in diffcore_count_changes."
>
> I know there have been a couple fixes since you posted that, but if it
> was the O(m*n) behavior that was causing your problem, it should still
> be reproducible. Linus suggested that this is a good data set to try
> things out with. Is there there a repository I can clone with git
> commands to run to repro this?
Yes, I still have the problem (the 2 minutes was _after_ we did fixes,
down from 20 minutes; the latest patch from Linus just covers the "exact
match" case where two files have identical SHA1s).
It's a 1.2G repo at the end of a slow DSL line, so rather than cloning
it, here's a way to reproduce a repo with similar properties:
-- >8 --
#!/bin/sh
rm -rf repo
mkdir repo && cd repo
# seq and openssl aren't portable, but the
# point is to generate 200 random 1M files
for i in `seq -f %03g 1 600`; do
openssl rand 100000 >$i.rand
done
# make repo, fully packed
# we don't bother trying to delta in the pack
# since the files are all random
git-init
git-add .
git-commit -q -m one
git-repack -a -d --window=0
# move every file
mkdir new
git-mv *.rand new
# modify every file
for i in new/*.rand; do
echo foo >>$i
done
git-add new
# this is the operation of interest
time git-diff --cached --raw -M -l0 >/dev/null
-- >8 --
The idea is to have a large number of files that are slightly changed
and moved, and to try to find the pairs. The diff takes about 20
seconds to run for me (the real repo has 1M files rather than 100K
files, but it's nice to have the tests take a lot less time). If you
want a bigger test, bump up the file size (or increase the number of
files, which will emphasize the quadratic behavior).
> 3) Compute the similarity metric, which I've defined here as
> max(c/|left file|, c/|right file|), where c the entry in the matrix
> for the file pair. Note that the files are treated as *sets* of lines
> (unordered, unique). The similarity score is a number between 0.0 and
> 1.0. Other similarity metrics are certainly possible.
We have to handle binary files, too. In the current implementation, we
consider either lines or "chunks", and similarity is increased by the
size of the chunk.
> * Some people might be concerned that it treats files as unordered
> sets of lines. The first thought might be to do this as a
> preprocessing step to cull the list of candidates, and then do a real
> delta. But in my experience, I haven't encountered a case where
> there's all that much to gain by doing that.
I think we are already treating the files as unordered sets of lines.
And really, I think there is some value in that, anyway. If I reverse
the order of all lines in a file, it might be useful for git to say
"this file came from that file".
> * This can be run on all files, not just adds/deletes. If I have a
Great. git has a "look for copies also" flag, but it is usually disabled
because of the computational cost. If we can get it low enough, it might
actually become a lot more useful.
> If anything about the explanation is unclear, let me know and I will
> try to clarify it. Playing around with the demo should illuminate
> what it does. You can run it on data sets of your own. All you need
> is 2 source trees and the "find" command to produce input to the
> script (see setup_demo.sh).
I'll try it on my test data, but it sounds like it doesn't really handle
binary files.
> As mentioned, I will try to do some tests on this, perhaps with Jeff's
> hard data set, and show that the results are good and that the
> algorithm is faster because the quadratic behavior is gone (if Linus
> doesn't beat me to it!).
Trying to fit it into the C git code would be useful, but I doubt I'll
have time to work on it tonight, since it's getting onto dawn here.
-Peff
^ permalink raw reply
* git filter-branch --subdirectory-filter error
From: Jan Wielemaker @ 2007-10-22 10:27 UTC (permalink / raw)
To: git
Hi,
Finished a big re-shuffle of a big project, while other developers
continued. Worked really well. Thanks guys! But now I have two top
directories and I want to create two new repositories, each containing
one of these directories (because the one holds copyrighted data and we
want the other to become public software). So, I happily run
$ git filter-branch --subdirectory-filter RDF HEAD
Where RDF is an existing directory. I get:
Rewrite 95807fe01c39d3092e3ac3a98061711323154d77 (1/12)fatal: Not a valid
object name 95807fe01c39d3092e3ac3a98061711323154d77:RDF
Could not initialize the index
I tried the procedure on some smaller test projects and it all worked
just fine. Running git version 1.5.3.4 on SuSE Linux. Also ran "git fsck
--full", which completed without any message.
Git show says:
gollem (eculture) 121_> git show 95807fe01c39d3092e3ac3a98061711323154d77 |
cat
commit 95807fe01c39d3092e3ac3a98061711323154d77
Merge: 76d2935... 58afb98...
Author: Jan Wielemaker <wielemak@science.uva.nl>
Date: Thu Oct 18 17:32:22 2007 +0200
Merge branch 'master' of /home/eculture/eculture
Any clue?
Thanks --- Jan
^ permalink raw reply
* Re: .gittattributes handling has deficiencies
From: Johannes Schindelin @ 2007-10-22 10:29 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: Shawn O. Pearce, david, git
In-Reply-To: <565E1D52-59C4-4EB8-AA81-FF94F346FE61@zib.de>
Hi,
On Mon, 22 Oct 2007, Steffen Prohaska wrote:
> .gitattributes is first looked for in the working directory, and if not
> there, .gitattributes is read from the index.
Of course we could change that to do it the other way round. But this
would contradict expectations when you edit .gitattributes and then
checkout single files without having git-add'ed .gitattributes first.
The biggest problem in your setup, however, is not if .gitattributes is
read from the index or the working directory. The biggest problem is that
files are not touched when their contents have not changed.
IOW if you have .gitattributes in the to-be-checked-out branch which say
that README is crlf, and in the current branch it is not, and README's
_contents_ are identical in both branches, a "git checkout
<that-other-branch>" will not rewrite README, and consequently not change
the working copy to crlf.
Ciao,
Dscho
^ permalink raw reply
* Re: [PATCH] git-format-patch: Don't number patches when there's only one
From: Johannes Schindelin @ 2007-10-22 10:22 UTC (permalink / raw)
To: Andreas Ericsson; +Cc: git, spearce
In-Reply-To: <471C77F0.5050701@op5.se>
Hi,
On Mon, 22 Oct 2007, Andreas Ericsson wrote:
> Johannes Schindelin wrote:
>
> > On Sun, 21 Oct 2007, Andreas Ericsson wrote:
> >
> > > [PATCH 1/1] looks a bit silly, and automagically handling this in
> > > git-format-patch makes some scripting around it a lot more pleasant.
> >
> > I think you should not use "-n" if you do not want to have the
> > numbers.
>
> This stems from creating scripts around it where I only want to see the
> numbers if there is more than a single patch. Currently I can't do that
> without running git-format-patch twice or re-implementing the revision
> parsing machinery to count revisions prior to passing arguments to
> format-patch.
Why not have something as simple as
numbered=
test $(git rev-list $options | wc -l) -gt 1 && numbered=-n
[...]
git format-patch $numbered $options
At the moment, the semantics of "--numbered" is really clear and precise.
And I really like that. It makes for less surprises.
Ciao,
Dscho
^ permalink raw reply
* Re: [PATCH] git-format-patch: Don't number patches when there's only one
From: Andreas Ericsson @ 2007-10-22 10:14 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: git, spearce
In-Reply-To: <Pine.LNX.4.64.0710221044080.25221@racer.site>
Johannes Schindelin wrote:
> Hi,
>
> On Sun, 21 Oct 2007, Andreas Ericsson wrote:
>
>> [PATCH 1/1] looks a bit silly, and automagically handling this in
>> git-format-patch makes some scripting around it a lot more pleasant.
>
> I think you should not use "-n" if you do not want to have the numbers.
This stems from creating scripts around it where I only want to see the
numbers if there is more than a single patch. Currently I can't do that
without running git-format-patch twice or re-implementing the revision
parsing machinery to count revisions prior to passing arguments to
format-patch.
> In circumstances as yours, where you can have patch series larger than
> one, I imagine that the "[PATCH 1/1]" bears an important information,
> which is not present in "[PATCH]": this patch series contains only one
> patch.
>
That's sort of implicit in [PATCH], since all patch-series added by the
tool I'm scripting will have [PATCH n/m], while I'd prefer for
single-patches to have only [PATCH].
> IOW I do not like your patch: too much DWIDNS (Do What I Did NOT Say) for
> me.
>
Would a separate option be acceptable to you?
It doesn't have to be fancy or short, since I really only mean to use it
from our scripts.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply
* Re: Linear time/space rename logic for *inexact* case
From: Andy C @ 2007-10-22 10:09 UTC (permalink / raw)
To: git
In-Reply-To: <596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com>
On 10/22/07, Andy C <andychup@gmail.com> wrote:
> So the algorithm is:
I think I can make this a lot clearer than I did, while glossing over
some details and the line_threshold parameter.
1) Make a "left index" and a "right index" out of the 2 sets of files,
{ line => [list of docs] }.
2) Remove any lines that appear in more than one doc from the left
index. Do the same for the right index. (this corresponds to
line_threshold=1 case)
3) For all lines, if the line appears in *both* the left index and the
right index, increment the count of the (row=doc from left set,
column=doc from right set) entry in the similarity matrix by 1. The
matrix is represented by a hash of 2-tuples => counts.
After this is done for all lines, then the matrix is sparsely filled
with the count of common lines between every pair of files in the 2
sets. The vast majority of cells in the matrix are implicitly 0 and
thus consume neither memory nor CPU with the hash table representation
of matrix.
4) Then you can use this to compute similarity scores.
Hopefully that is more clear... though I guess it might not be obvious
that it works for the problem that git has. I am fairly sure it does,
but the demo should allow us to evaluate that.
Andy
^ permalink raw reply
* Re: [PATCH] git-format-patch: Don't number patches when there's only one
From: Johannes Schindelin @ 2007-10-22 9:44 UTC (permalink / raw)
To: Andreas Ericsson; +Cc: git, spearce
In-Reply-To: <20071021144637.762085BB92@nox.op5.se>
Hi,
On Sun, 21 Oct 2007, Andreas Ericsson wrote:
> [PATCH 1/1] looks a bit silly, and automagically handling this in
> git-format-patch makes some scripting around it a lot more pleasant.
I think you should not use "-n" if you do not want to have the numbers.
In circumstances as yours, where you can have patch series larger than
one, I imagine that the "[PATCH 1/1]" bears an important information,
which is not present in "[PATCH]": this patch series contains only one
patch.
IOW I do not like your patch: too much DWIDNS (Do What I Did NOT Say) for
me.
Ciao,
Dscho
^ permalink raw reply
* Linear time/space rename logic for *inexact* case
From: Andy C @ 2007-10-22 9:40 UTC (permalink / raw)
To: git
[-- Attachment #1: Type: text/plain, Size: 9134 bytes --]
I just subscribed to this list, so sorry I can't respond to the
threads already started here. I'm the guy that was mailing Linus
about this algorithm to do similarity detection in linear time,
mentioned here:
Subject: [PATCH] Split out "exact content match" phase of rename detection
http://marc.info/?l=git&m=119299209323278&w=2
Subject: [PATCH, take 1] Linear-time/space rename logic (exact renames
http://marc.info/?l=git&m=119301122908852&w=2
Jeff, in this message http://marc.info/?l=git&m=119303566130201&w=2 I
think you basically hit on the first half of what I was getting at.
The step 1 you describe is the first step of the algorithm -- make an
inverted index of lines (and you can use a hash code of the line to
stand in for the line).
To avoid the m*n memory usage of step 2, I use a hash table which maps
2-tuples to counts to represent the sparse similarity matrix, instead
of representing it directly. The 2-tuple is a pair of filenames,
corresponding to the row/column of the matrix, and the counts/values
are the entries in the matrix.
You can also prune entries which have a long "postings list" (using
the term in the IR sense here:
http://www.xapian.org/docs/intro_ir.html). This has the nice side
effect of getting rid of quadratic behavior, *and* making the
algorithm more accurate because it stops considering common lines like
"#endif" as contributing to similarity.
This pruning of common lines from the index makes step 3 linear. In
fact, I prune the index to only include lines which occur just *once*.
Nearly every line of code in real data sets is unique, so this works
well in practice.
I already sent this demo to Linus, but I think it's worth sending to
the list as well. I am just going to copy parts of my earlier e-mails
below, and attach the same demos (hopefully it is kosher to send
attachments to this list).
Before I do that, Jeff, can you still reproduce this problem:
http://marc.info/?l=git&m=118975452330644&w=2
"Of course, 2 minutes for a git-status is still way too slow, so there we
might still need a limiter. It also looks like 57% of my time is spent
in spanhash_find, and another 29% in diffcore_count_changes."
I know there have been a couple fixes since you posted that, but if it
was the O(m*n) behavior that was causing your problem, it should still
be reproducible. Linus suggested that this is a good data set to try
things out with. Is there there a repository I can clone with git
commands to run to repro this?
OK, so attached is a little demo of the algorithm, which is (very
little) Python code, but with comments so non-Python people can
hopefully follow it. Because of this the timings are not very
meaningful, but it proves that the algorithm doesn't blow up.
I ran it on the entire Linux 2.4 vs. Linux 2.6 codebases. It is
*only* considering file content. You can rename every file in both
source trees to completely random strings and it will still match
files up. There is nothing about filenames, or identical files, and
you can consider the whole 2.4 side "all deletes" and the whole 2.6
side "all adds". The size of the matrix would be around 286 million
cells, but here I only represent the non-zero entries in the matrix,
which is only 15,406 cells.
$ wc -l similarity_demo.py
233 similarity_demo.py
$ ./similarity_demo.py in-all-*
12697 * 22530 = 286063410 total possible pairs of files
Indexing the first set of 12697 files (threshold=1)
Indexing the second set of 22530 files (threshold=1)
Sum of file sizes in first set: 2134249 lines
Sum of file sizes in second set: 3424338 lines
Size of index over first set: 2134249
Size of index over second set: 3424338
Computing union of lines in the indices
Total unique lines in both indices: 4384375
Making sparse common line matrix
Calculating similarity for 15406 pairs of files
Sorting 15406 similar pairs
Writing similarity report
Wrote out-in-all-24-vs-in-all-26.1.txt
End
------
Report
------
Indexing the first set of 12697 files (threshold=1) 29.540s
Indexing the second set of 22530 files (threshold=1) 111.041s
Computing union of lines in the indices 7.450s
Making sparse common line matrix 13.468s
Calculating similarity for 15406 pairs of files 0.055s
Sorting 15406 similar pairs 0.030s
Writing similarity report 0.249s
Total time 161.834s
The script outputs a text file with 15,406 similar pairs, in order of
similarity (1.0's are at the top):
andychu demo$ wc -l out-in-all-24-vs-in-all-26.1.txt
15406 out-in-all-24-vs-in-all-26.1.txt
andychu demo$ head -n3 out-in-all-24-vs-in-all-26.1.txt
( 51) linux-2.4.35.3/include/asm-m68k/apollodma.h
( 51) linux-2.6.23.1/include/asm-m68k/apollodma.h 51 1.000
( 94) linux-2.4.35.3/fs/jfs/jfs_btree.h
( 94) linux-2.6.23.1/fs/jfs/jfs_btree.h 94 1.000
( 21) linux-2.4.35.3/Documentation/fb/00-INDEX
( 21) linux-2.6.23.1/Documentation/fb/00-INDEX 21 1.000
...
And here is my explanation from an earlier mail, with some slight edits:
So the algorithm is:
1) Make an inverted index of the left set and right set. That is {
line => ["postings list", i.e. the list of files the line appears in]
}. To get rid of common lines like "} else {" or "#endif", there is
an arbitrary "line threshold".
2) Combine the 2 indices into a (sparse) rectangular matrix. For each
line, iterate over all pairs in the postings list, and increment the
cell in the matrix for that pair by 1. The index is extremely
shallow, since nearly all lines of code are unique. The common case
is that the postings list is of length 1. And the line threshold caps
the length of the postings list.
In the code the matrix represented by a hash of filename pairs to
integer counts. So then the count is the number of lines that the 2
files have in common.
3) Compute the similarity metric, which I've defined here as
max(c/|left file|, c/|right file|), where c the entry in the matrix
for the file pair. Note that the files are treated as *sets* of lines
(unordered, unique). The similarity score is a number between 0.0 and
1.0. Other similarity metrics are certainly possible.
A few things to notice about this algorithm:
* It takes advantage of the fact that code edits are typically
line-oriented, and nearly every line of code is unique. (This same
technique can be used for arbitrary documents like English text, but
it's a bit harder since you basically have to find a way to make a
"deep" index of words shallow, to speed it up. For code, the index of
lines is already shallow.)
* Some people might be concerned that it treats files as unordered
sets of lines. The first thought might be to do this as a
preprocessing step to cull the list of candidates, and then do a real
delta. But in my experience, I haven't encountered a case where
there's all that much to gain by doing that.
* The line_threshold might appear to be a hack, but it actually
*improves* accuracy. If you think about it, lines like "#endif"
should not contribute to the similarity between 2 files. It also
makes it impossible to construct pathological blow-up cases. If you
have 1000 files on the left and 1000 files on the right that are all
identical to each other, then every line will get pruned, and thus the
entire similarity matrix will be 0, which is arguably what you want.
There is a -l flag to the script to experiment with this threshold.
* This can be run on all files, not just adds/deletes. If I have a
change of only edits, it could be the case that I moved 500 lines from
one file 1000 line file to another 1000 line file, and changed 75
lines within the 500. It would be meaningful to see a diff in this
case, so that I can see those 75 edits (and a great feature!)
* The similarity is defined that way so that if one file is completely
contained in another, the similarity is 1.0. So if I move a 1000 line
file and add 9,000 lines to it, the similarity for the file pair will
still be 1.0. I believe this is a feature, like the point above.
* I don't have a C implementation but I believe the constant factors
should be very low. You could use the same CRC you were talking about
to reduce memory in storing the lines. It seems like this algorithm
is amenable to trading memory for speed, as you mention. Since it is
raw string manipulation, C should be at least 10x faster than Python,
and I wouldn't be surprised if an optimized implementation is 50 or
100x faster.
...
If anything about the explanation is unclear, let me know and I will
try to clarify it. Playing around with the demo should illuminate
what it does. You can run it on data sets of your own. All you need
is 2 source trees and the "find" command to produce input to the
script (see setup_demo.sh).
As mentioned, I will try to do some tests on this, perhaps with Jeff's
hard data set, and show that the results are good and that the
algorithm is faster because the quadratic behavior is gone (if Linus
doesn't beat me to it!).
thanks,
Andy
[-- Attachment #2: setup_demo.sh --]
[-- Type: application/x-sh, Size: 983 bytes --]
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: similarity_demo.py --]
[-- Type: text/x-python; name=similarity_demo.py, Size: 7823 bytes --]
#!/usr/bin/python2.4
"""
Demo of linear-time code similarity calculation.
Author: Andy Chu (andychu at google dot com)
"""
import optparse
import os
import md5
import re
import sys
import time
from pprint import pprint
class MultiTimer:
"""To give a general idea of how long each step takes."""
def __init__(self, loud=False):
self.timestamps = []
self.descriptions = []
def checkpoint(self, description=''):
print description
self.timestamps.append(time.time())
self.descriptions.append(description)
def report(self):
print "------"
print "Report"
print "------"
# Compute time differences
format = "%-60s %.3fs"
for i in xrange(len(self.timestamps)-1):
interval = self.timestamps[i+1] - self.timestamps[i]
print format % (self.descriptions[i], interval)
print format % ('Total time', self.timestamps[-1] - self.timestamps[0])
def IndexFiles(files, line_threshold):
"""
Given a list of filenames, produce an inverted index.
Any lines which occur in "line_threshold" files are thrown out.
Also, for similarity scoring, keep track of the number of important lines in
each file. Files are treated as (unordered) *sets* of (unique) lines.
Unimportant lines are duplicates and lines which exceed line_threshold.
"""
# { line -> list of filenames the line appears in }
index = {}
# { filename -> number of unique lines in the file }
sizes = {}
for filename in files:
try:
f = file(filename)
except IOError, e:
print e, filename
sys.exit(1)
for line in f:
line = line.strip() # Could remove *all* whitespace here
if line: # Skip blank lines
if line in index:
filelist = index[line]
# Stop keeping track once we reach the threshold
if len(filelist) == line_threshold + 1:
continue
# Only count the first occurrence of the line in the file
if filelist[-1] != filename:
filelist.append(filename)
sizes[filename] = sizes.get(filename, 0) + 1
else:
index[line] = [filename]
sizes[filename] = sizes.get(filename, 0) + 1
f.close()
# Now remove any lines that hit the threshold from the index, and adjust the
# file sizes.
to_remove = []
for line, filelist in index.iteritems():
if len(filelist) == line_threshold + 1:
to_remove.append(line)
for f in filelist:
sizes[f] -= 1
for line in to_remove:
del index[line]
return index, sizes
def FindSimilarPairs(set1, set2, reportfile, line_threshold):
"""Calculates pairwise similarity between two sets of files, using no other
information but the contents of the files (i.e. no filename information).
Args:
set1, set2: The 2 lists of file system paths to compare.
reportfile: name of the output text file
line_threshold:
We prune the index of entries that occur more than this number of times.
For example, the line "} else {" may occur very frequently in the code,
but we just throw it out, since the fact that 2 files have this line in
common is not very meaningful.
Making line_threshold a constant also makes the algorithm linear. In
practice, with real data sets, the results should be quite stable with
respect to this parameter. That is, they should not vary very much at
all if line_threshold = 1 or 100 or 1000.
"""
mt = MultiTimer()
print '%s * %s = %s total possible pairs of files' % (
len(set1), len(set2), len(set1) * len(set2))
#
# 1. Generate inverted indices and size information for each set.
#
mt.checkpoint("Indexing the first set of %d files (threshold=%d)" % (
len(set1), line_threshold))
index1, sizes1 = IndexFiles(set1, line_threshold)
mt.checkpoint("Indexing the second set of %d files (threshold=%d)" % (
len(set2), line_threshold))
index2, sizes2 = IndexFiles(set2, line_threshold)
print "Sum of file sizes in first set: %s lines" % sum(sizes1.values())
print "Sum of file sizes in second set: %s lines" % sum(sizes2.values())
print "Size of index over first set:", len(index1)
print "Size of index over second set:", len(index2)
#
# 2. Combine the 2 indices to form sparse matrix that counts common lines.
#
# Pairs which do not appear have an implicit count of 0.
#
# This is a sparse matrix represented as a dictionary of 2-tuples:
# (filename # in set1, filename in set2) -> number of (unique) lines that
# they have in common
mt.checkpoint("Computing union of lines in the indices")
all_lines = set(index1) | set(index2)
print 'Total unique lines in both indices: %s' % len(all_lines)
mt.checkpoint("Making sparse common line matrix")
matrix = {}
for line in all_lines:
files1 = index1.get(line)
files2 = index2.get(line)
if files1 and files2:
# For every pair of files that contain this line, increment the
# corresponding cell in the common line matrix.
#
# Since we pruned the index, this whole double loop is constant time. If
# the line_threshold is 1 (default), then the whole thing is just a
# single iteration.
for f1 in files1:
for f2 in files2:
matrix[(f1, f2)] = matrix.get((f1, f2), 0) + 1
mt.checkpoint("Calculating similarity for %s pairs of files" % len(matrix))
#
# 3. Calculate the similarity of each cell in the matrix.
#
# The similarity metric is number of common lines divided by the file size
# (and take the greater of the 2). Note that this means that if file1 is
# entirely *contained* in file2, then the similarity of the pair will be 1.0.
# This is desirable since you can detect if 99,000 lines were added to a 1,000
# line file, etc. Other similarity metrics are possible.
#
similar_pairs = []
for (file1, file2), num_common in matrix.iteritems():
c = float(num_common)
# TODO: Write notes on similarity metric
similarity = max(c / sizes1[file1], c / sizes2[file2])
similar_pairs.append((file1, file2, num_common, similarity))
mt.checkpoint("Sorting %d similar pairs" % len(similar_pairs))
similar_pairs.sort(key=lambda x: x[3], reverse=True)
mt.checkpoint("Writing similarity report")
f = open(reportfile, 'w')
for file1, file2, num_common, similarity in similar_pairs:
# Put a * after entries where the relative paths are *not* the same, just
# for convenience when looking at the report.
if file1.split('/')[1:] != file2.split('/')[1:]:
mark = '*'
else:
mark = ''
print >> f, '(%4d) %-40s (%4d) %-40s %4d %.3f %s' % (
sizes1[file1], file1, sizes2[file2], file2, num_common, similarity,
mark)
f.close()
print 'Wrote %s' % reportfile
mt.checkpoint("End")
mt.report()
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option(
'-l', '--line-threshold', dest='line_threshold', type='int', default=1,
help='ignore lines that occur more than this number of times in either '
'set')
parser.add_option(
'-o', '--out', dest='reportfile', default=None,
help='Write output to this file instead of the default.')
(options, argv) = parser.parse_args()
try:
left, right = argv[:2]
except ValueError:
print 'Pass 2 files containing paths as arguments (left tree, right tree).'
sys.exit(1)
if not options.reportfile:
options.reportfile = 'out-%s-vs-%s.%s.txt' % (
os.path.splitext(os.path.basename(left))[0],
os.path.splitext(os.path.basename(right))[0],
options.line_threshold)
set1 = [f.strip() for f in open(left).readlines()]
set2 = [f.strip() for f in open(right).readlines()]
FindSimilarPairs(set1, set2, options.reportfile, options.line_threshold)
^ permalink raw reply
* Re: [PATCH] Add some fancy colors in the test library when terminal supports it.
From: Johannes Sixt @ 2007-10-22 8:53 UTC (permalink / raw)
To: Pierre Habouzit; +Cc: Shawn O. Pearce, git
In-Reply-To: <20071022081341.GC32763@artemis.corp>
Pierre Habouzit schrieb:
> Signed-off-by: Pierre Habouzit <madcoder@debian.org>
> ---
>
> Maybe this is just me, but I don't find the output of the test-suite
> easy to watch while scrolling. This puts some colors in proper places.
>
> * end-test summaries are in green or red depending on the sucess of
> the tests.
> * errors are in red.
> * skipped tests and other things that tests `say` are in brown (now
> you can _see_ that your testsuite skips some tests on purpose, I
> only noticed recently that I missed part of the environment for
> proper testing).
>
> I'm not 100% sure the test to see if terminal supports color is correct, and
> people using emacs shell buffer or alike tools may have better ideas on how to
> make it.
>
> and yes, I know that it "depends" upon tput, but if tput isn't available, the
> [ "x$TERM" != "xdumb" ] && tput hpa 60 >/dev/null 2>&1 && tput setaf 1 >/dev/null 2>&1
> expression will fail, and color will be disabled.
>
> t/test-lib.sh | 32 ++++++++++++++++++++++----------
> 1 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/t/test-lib.sh b/t/test-lib.sh
> index cc1253c..c6521c0 100644
> --- a/t/test-lib.sh
> +++ b/t/test-lib.sh
> @@ -59,14 +59,24 @@ esac
> # '
> # . ./test-lib.sh
>
> +[ "x$TERM" != "xdumb" ] && tput hpa 60 >/dev/null 2>&1 && tput setaf 1 >/dev/null 2>&1
> +nocolor=$?
test "x$TERM" != "xdumb" &&
tput hpa 60 >/dev/null 2>&1 &&
tput setaf 1 >/dev/null 2>&1 &&
color=t
BTW, doesn't tput fail if stdout/stderr is not a terminal, like above?
> +
> +say_color () {
> + [ "$nocolor" = 0 ] && [ "$1" != '-1' ] && tput setaf "$1"
> + shift
> + echo "* $*"
> + tput op
> +}
What if tput is not available, like on Windows? How about this (at the end
of the file, so it can obey --no-color):
if test "$color"; then
say_color () {
test "$1" != '-1' && tput setaf "$1"
shift
echo "* $*"
tput op
}
else
say_color() {
shift
echo "* $*"
}
fi
> + --no-color)
> + nocolor=1; shift ;;
color=; shift ;;
-- Hannes "We don't need no double negation"
^ permalink raw reply
* [PATCH] s/pattern/prefix/ in help's list_commands
From: Scott R Parish @ 2007-10-22 8:51 UTC (permalink / raw)
To: git
list_commands() currently accepts and ignores a "pattern" argument,
and then hard codes a prefix as well as some magic numbers. This
renames the arg from pattern to prefix and uses that instead of the
hardcoded stuff.
Signed-off-by: Scott R Parish <srp@srparish.net>
---
help.c | 13 +++++++------
1 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/help.c b/help.c
index b0d2dd4..950f62d 100644
--- a/help.c
+++ b/help.c
@@ -93,11 +93,12 @@ static void pretty_print_string_list(struct cmdname **cmdname, int longest)
}
}
-static void list_commands(const char *exec_path, const char *pattern)
+static void list_commands(const char *exec_path, const char *prefix)
{
unsigned int longest = 0;
char path[PATH_MAX];
int dirlen;
+ int prefix_len = strlen(prefix);
DIR *dir = opendir(exec_path);
struct dirent *de;
@@ -120,7 +121,7 @@ static void list_commands(const char *exec_path, const char *pattern)
struct stat st;
int entlen;
- if (prefixcmp(de->d_name, "git-"))
+ if (prefixcmp(de->d_name, prefix))
continue;
strcpy(path+dirlen, de->d_name);
if (stat(path, &st) || /* stat, not lstat */
@@ -128,14 +129,14 @@ static void list_commands(const char *exec_path, const char *pattern)
!(st.st_mode & S_IXUSR))
continue;
- entlen = strlen(de->d_name);
+ entlen = strlen(de->d_name) - prefix_len;
if (has_extension(de->d_name, ".exe"))
entlen -= 4;
if (longest < entlen)
longest = entlen;
- add_cmdname(de->d_name + 4, entlen-4);
+ add_cmdname(de->d_name + prefix_len, entlen);
}
closedir(dir);
@@ -143,7 +144,7 @@ static void list_commands(const char *exec_path, const char *pattern)
printf("----------------------------");
mput_char('-', strlen(exec_path));
putchar('\n');
- pretty_print_string_list(cmdname, longest - 4);
+ pretty_print_string_list(cmdname, longest);
putchar('\n');
}
@@ -210,7 +211,7 @@ int cmd_help(int argc, const char **argv, const char *prefix)
else if (!strcmp(help_cmd, "--all") || !strcmp(help_cmd, "-a")) {
printf("usage: %s\n\n", git_usage_string);
if(exec_path)
- list_commands(exec_path, "git-*");
+ list_commands(exec_path, "git-");
exit(0);
}
--
1.5.3.4.209.g5d1ce-dirty
^ permalink raw reply related
* Re: Howto request: going home in the middle of something?
From: Jan Wielemaker @ 2007-10-22 8:44 UTC (permalink / raw)
To: Petr Baudis; +Cc: git
In-Reply-To: <20071018112758.GN18279@machine.or.cz>
Thanks for the replies. I think I can live with something like this
<work, in the middle of something>
$ git checkout -b home
$ git commit
$ git checkout master
<arriving at home>
$ git jan@work:repo fetch home:home (using ssh)
$ git checkout home
<continue editing>
$ git commit --amend
$ git checkout master
$ git merge home
$ git -d home
$ git commit
$ git push
<arriving at work>
$ git -d home
$ git pull
Its still a bit many commands and you have to be aware what you are
doing for quite a while, but it does provide one single clean commit
message, doesn't change the shared repo until all is finished and allows
to abandon all work without leaving traces.
Personally I'd be more happy with
<work, in the middle of something>
$ git stash
<arriving at home>
$ git stash fetch jan@work{0} (well, some sensible syntax)
$ git stash apply
<continue editing>
$ git commit
$ git push
<arriving at work>
$ git pull
Its not only shorter, but reduces the risc to make mistakes. I think the
missing fetch to copy the stashed data from work to home is actually
there if you know a bit more about git internals. Right? Ideally, this
could be combined in a little command that will simply move the
uncommitted work from one clone to another, provided you have ssh access
to the machine from which you want to fetch the work.
--- Jan
On Thursday 18 October 2007 13:27, Petr Baudis wrote:
> On Thu, Oct 18, 2007 at 11:44:22AM +0200, Jan Wielemaker wrote:
> > I've somewhere seen it in a mail, but I can't find it anymore. I have a
> > bare central (public) repository and clones on various machines I work
> > on. We all know it, you're right in the middle of something and it is
> > really time to go home. You want to pick up your work at home, but
> > without pushing to the shared repository.
> >
> > I'm sure GIT can do this elegantly, but I'm not yet sure how. I guess
> > Ideally I want "git stash" at work, transfer the stashed changes to my
> > other machine and apply them. How do I do that?
> >
> > Alternatively, I guess, one can commit at machine A, fetch the commit
> > from machine A and continue. I'm still too uncertain about the remote
> > access options to work this out properly, but it also feels less
> > clean.
>
> this should be pretty simple assuming SSH access to machine A. Git can
> fetch over SSH, so it's merely about telling it that repository X is
> available over ssh over there and it'll fetch it home.
>
> The exact setup depends on whether you want to do this just once or
> semi-regularily. If the former, just
>
> git pull git+ssh://a.machine.aero/absolute/path
>
> Note that this should fetch only the remote master branch, if I'm not
> mistaken.
>
> If the latter, tell your home repository about your work repository:
>
> git remote add workrepo git+ssh://a.machine.aero/absolute/path
>
> Then, you can anytime just
>
> git fetch workrepo
>
> and it will fetch all the branches from workrepo; whether you want to
> use git fetch and git merge or git pull depends on your local
> arrangement of branches at home.
>
>
> So, basically, when fetching you deal with your work repository
> exactly the same way as in the shared repository.
>
> When pushing, this is not so trivial. Git _allows_ you to just push to
> your work repository, but if you push to a branch that is currently
> checked out, unexpected things will happen - always avoid that. If you
> can fetch from home at work, do. If not, at least push to a branch at
> work that can never be checked out and is reserved for that purpose.
^ permalink raw reply
* [PATCH] "git" calls help_unknown_cmd(""); "git help" and "git help -a" return 0
From: Scott R Parish @ 2007-10-22 8:32 UTC (permalink / raw)
To: git
Signed-off-by: Scott R Parish <srp@srparish.net>
---
git.c | 5 ++---
help.c | 4 ++--
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/git.c b/git.c
index 853e66c..e1c99e3 100644
--- a/git.c
+++ b/git.c
@@ -445,9 +445,8 @@ int main(int argc, const char **argv)
if (!prefixcmp(argv[0], "--"))
argv[0] += 2;
} else {
- /* Default command: "help" */
- argv[0] = "help";
- argc = 1;
+ /* The user didn't specify a command; give them help */
+ help_unknown_cmd("");
}
cmd = argv[0];
diff --git a/help.c b/help.c
index 1cd33ec..b0d2dd4 100644
--- a/help.c
+++ b/help.c
@@ -204,14 +204,14 @@ int cmd_help(int argc, const char **argv, const char *prefix)
if (!help_cmd) {
printf("usage: %s\n\n", git_usage_string);
list_common_cmds_help();
- exit(1);
+ exit(0);
}
else if (!strcmp(help_cmd, "--all") || !strcmp(help_cmd, "-a")) {
printf("usage: %s\n\n", git_usage_string);
if(exec_path)
list_commands(exec_path, "git-*");
- exit(1);
+ exit(0);
}
else
--
1.5.3.4.209.g5d1ce-dirty
^ permalink raw reply related
* [PATCH] Add some fancy colors in the test library when terminal supports it.
From: Pierre Habouzit @ 2007-10-22 8:13 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git
Signed-off-by: Pierre Habouzit <madcoder@debian.org>
---
Maybe this is just me, but I don't find the output of the test-suite
easy to watch while scrolling. This puts some colors in proper places.
* end-test summaries are in green or red depending on the sucess of
the tests.
* errors are in red.
* skipped tests and other things that tests `say` are in brown (now
you can _see_ that your testsuite skips some tests on purpose, I
only noticed recently that I missed part of the environment for
proper testing).
I'm not 100% sure the test to see if terminal supports color is correct, and
people using emacs shell buffer or alike tools may have better ideas on how to
make it.
and yes, I know that it "depends" upon tput, but if tput isn't available, the
[ "x$TERM" != "xdumb" ] && tput hpa 60 >/dev/null 2>&1 && tput setaf 1 >/dev/null 2>&1
expression will fail, and color will be disabled.
t/test-lib.sh | 32 ++++++++++++++++++++++----------
1 files changed, 22 insertions(+), 10 deletions(-)
diff --git a/t/test-lib.sh b/t/test-lib.sh
index cc1253c..c6521c0 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -59,14 +59,24 @@ esac
# '
# . ./test-lib.sh
+[ "x$TERM" != "xdumb" ] && tput hpa 60 >/dev/null 2>&1 && tput setaf 1 >/dev/null 2>&1
+nocolor=$?
+
+say_color () {
+ [ "$nocolor" = 0 ] && [ "$1" != '-1' ] && tput setaf "$1"
+ shift
+ echo "* $*"
+ tput op
+}
+
error () {
- echo "* error: $*"
+ say_color 9 "* error: $*"
trap - exit
exit 1
}
say () {
- echo "* $*"
+ say_color 3 "$*"
}
test "${test_description}" != "" ||
@@ -84,6 +94,8 @@ do
exit 0 ;;
-v|--v|--ve|--ver|--verb|--verbo|--verbos|--verbose)
verbose=t; shift ;;
+ --no-color)
+ nocolor=1; shift ;;
--no-python)
# noop now...
shift ;;
@@ -122,13 +134,13 @@ test_tick () {
test_ok_ () {
test_count=$(expr "$test_count" + 1)
- say " ok $test_count: $@"
+ say_color -1 " ok $test_count: $@"
}
test_failure_ () {
test_count=$(expr "$test_count" + 1)
test_failure=$(expr "$test_failure" + 1);
- say "FAIL $test_count: $1"
+ say_color 9 "FAIL $test_count: $1"
shift
echo "$@" | sed -e 's/^/ /'
test "$immediate" = "" || { trap - exit; exit 1; }
@@ -158,9 +170,9 @@ test_skip () {
done
case "$to_skip" in
t)
- say >&3 "skipping test: $@"
+ say_color 10 >&3 "skipping test: $@"
test_count=$(expr "$test_count" + 1)
- say "skip $test_count: $1"
+ say_color 10 "skip $test_count: $1"
: true
;;
*)
@@ -247,11 +259,11 @@ test_done () {
# The Makefile provided will clean this test area so
# we will leave things as they are.
- say "passed all $test_count test(s)"
+ say_color 2 "passed all $test_count test(s)"
exit 0 ;;
*)
- say "failed $test_failure among $test_count test(s)"
+ say_color 9 "failed $test_failure among $test_count test(s)"
exit 1 ;;
esac
@@ -296,8 +308,8 @@ do
done
case "$to_skip" in
t)
- say >&3 "skipping test $this_test altogether"
- say "skip all tests in $this_test"
+ say_color 10 >&3 "skipping test $this_test altogether"
+ say_color 10 "skip all tests in $this_test"
test_done
esac
done
--
1.5.3.4.1342.ge0cd-dirty
^ permalink raw reply related
* Re: Git User's Survey 2007 unfinished summary continued
From: Andreas Ericsson @ 2007-10-22 7:59 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Jakub Narebski, Steffen Prohaska, Federico Mena Quintero, git
In-Reply-To: <Pine.LNX.4.64.0710212308540.25221@racer.site>
Johannes Schindelin wrote:
> Hi,
>
> On Sun, 21 Oct 2007, Andreas Ericsson wrote:
>
>> Johannes Schindelin wrote:
>>
>>> On Sun, 21 Oct 2007, Jakub Narebski wrote:
>>>
>>>> On 10/20/07, Steffen Prohaska <prohaska@zib.de> wrote:
>>>>
>>>>> Maybe we could group commands into more categories?
>>>>>
>>>>> plumbing: should be hidden from the 'normal' user. Porcelain
>>>>> should be sufficient for every standard task.
>>>> The problem is division between what is porcelain and what is
>>>> plumbing. Some commands are right on border (git-fsck,
>>>> git-update-index, git-rev-parse comes to mind).
>>> Sorry, but my impression from the latest mails was that the commands
>>> are fine. What is lacking is a nice, _small_ collection of
>>> recommended workflows. And when we have agreed on such a set of
>>> workflows, we optimize the hell out of them. Only this time it is not
>>> performance, but user-friendliness.
>> http://www.kernel.org/pub/software/scm/git/docs/everyday.html would be a
>> good starting point, I think.
>
> I don't think so. Way too few authors were involved in writing this
> document, so it is not "typical" in and of itself.
>
> I'd really like people to respond not so much with broad and general
> statements to my mail (those statements tend to be rather useless to find
> how to make git more suitable to newbies), but rather with concrete top
> ten lists of what they do daily.
>
> My top ten list:
>
> - git diff
> - git commit
> - git status
> - git fetch
> - git rebase
> - git pull
> - git cherry-pick
> - git bisect
> - git push
> - git add
>
> So again, I'd like people who did _not_ tweak git to their likings to tell
> the most common steps they do. My hope is that we see things that are
> good practices, but could use an easier user interface.
>
I'm not so sure we'd want to hide commands that git-gurus simply do not
use, such as git-blame. In my opinion, we should just locate the highest
level available of UI tool that implements a particular feature and have
that listed in the git[- ]<tab> view.
I doubt many people on this list regularly use git-blame but it's a
command that's definitely non-trivial to script out using only the
"proper" commands, and CVS/SVN users expect it to be there, so it's
probably worth listing anyhow.
Similarly, it might be helpful to have help topics the gdb way, like
"git help patches". It's one of those things that people have come to
expect from a software tool, so perhaps we should humor them? Given gits
"every help topic is a man-page" idiom, this shouldn't require any real
technical effort.
Such topics should probably include
merge/merges/merging - overview of various ways of putting two lines of
development back together
patch/patches - how to create, send and apply
tags/branches/refs - what they are, why they're good, link to merging
I'm currently swanked with day-job work, but I'll see if I can get some
prototype docs done later this week and check if it requires any code
support. If anyone's well-versed in asciidoc HTML-indexing and wants to
provide pointers to what I should think about for generating those topic
menus as html docs, feel free to chuck me an email.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply
* [PATCH] don't set-group-id on directories on apple
From: Scott R Parish @ 2007-10-22 7:55 UTC (permalink / raw)
To: git
"git init --shared=all" was failing because chmod was returning
EPERM. According to the man page, the set-group-id behavior is
already default: man 2 mkdir:
The directory's group ID is set to that of the parent directory
in which it is created.
Signed-off-by: Scott R Parish <srp@srparish.net>
---
path.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/path.c b/path.c
index 4260952..4089753 100644
--- a/path.c
+++ b/path.c
@@ -282,8 +282,10 @@ int adjust_shared_perm(const char *path)
: (shared_repository == PERM_EVERYBODY
? (S_IXGRP|S_IXOTH)
: 0));
+#if !defined(__APPLE__)
if (S_ISDIR(mode))
mode |= S_ISGID;
+#endif
if ((mode & st.st_mode) != mode && chmod(path, mode) < 0)
return -2;
return 0;
--
1.5.3.4.209.g5d1ce-dirty
^ permalink raw reply related
* [PATCH] Let users override name of per-directory ignore file
From: Andreas Ericsson @ 2007-10-15 12:09 UTC (permalink / raw)
To: git; +Cc: spearce
When collaborating with projects managed by some other
scm, it often makes sense to have git read that other
scm's ignore-files. This patch lets git do just that, if
the user only tells it the name of the per-directory
ignore file by specifying the newly introduced git config
option 'core.ignorefile'.
Theoretically, this could cause problems when projects get
ported from some other scm to git, but in practice that
is a moot point, as such changes are always followed by a
flagday anyway.
Signed-off-by: Andreas Ericsson <ae@op5.se>
---
Documentation/config.txt | 10 ++++++++++
Documentation/gitignore.txt | 5 +++++
builtin-add.c | 11 +++++++++--
wt-status.c | 9 ++++++++-
4 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/Documentation/config.txt b/Documentation/config.txt
index edf50cd..5170e27 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -275,6 +275,16 @@ You probably do not need to adjust this value.
+
Common unit suffixes of 'k', 'm', or 'g' are supported.
+core.ignorefile::
+ Tells git to use a different file for per-directory ignores.
+ This is useful when one wishes to use git for a project
+ when the upstream repository is managed by some other SCM
+ whose ignore-file formats are the same as that of git.
+ For example, setting core.ignorefile to .svnignore in
+ repos where one interacts with the upstream project repo
+ using gitlink:git-svn[1] will make a both SVN users and
+ your own repo ignore the same files.
+
core.excludesfile::
In addition to '.gitignore' (per-directory) and
'.git/info/exclude', git looks into this file for patterns
diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt
index e8b8581..f82eac6 100644
--- a/Documentation/gitignore.txt
+++ b/Documentation/gitignore.txt
@@ -32,6 +32,11 @@ precedence, the last matching pattern decides the outcome):
`.gitignore` file. A project normally includes such
`.gitignore` files in its repository, containing patterns for
files generated as part of the project build.
+ The name of the `.gitignore` file can be changed by setting
+ the configuration variable 'core.ignorefile'. This is useful
+ when using git for projects where upstream is using some other
+ SCM. For example, setting 'core.ignorefile' to `.cvsignore`
+ will make git ignore the same files CVS would.
* Patterns read from `$GIT_DIR/info/exclude`.
diff --git a/builtin-add.c b/builtin-add.c
index 966e145..c785d99 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -19,6 +19,7 @@ static const char builtin_add_usage[] =
static int take_worktree_changes;
static const char *excludes_file;
+static const char *ignore_file = ".gitignore";
static void prune_directory(struct dir_struct *dir, const char **pathspec, int prefix)
{
@@ -57,7 +58,7 @@ static void fill_directory(struct dir_struct *dir, const char **pathspec,
memset(dir, 0, sizeof(*dir));
if (!ignored_too) {
dir->collect_ignored = 1;
- dir->exclude_per_dir = ".gitignore";
+ dir->exclude_per_dir = ignore_file;
path = git_path("info/exclude");
if (!access(path, R_OK))
add_excludes_from_file(dir, path);
@@ -144,6 +145,12 @@ static int git_add_config(const char *var, const char *value)
excludes_file = xstrdup(value);
return 0;
}
+ if (!strcmp(var, "core.ignorefile")) {
+ if (!value)
+ die("core.ignorefile without value");
+ ignore_file = xstrdup(value);
+ return 0;
+ }
return git_default_config(var, value);
}
@@ -158,7 +165,7 @@ int interactive_add(void)
static struct lock_file lock_file;
static const char ignore_error[] =
-"The following paths are ignored by one of your .gitignore files:\n";
+"The following paths are ignored:\n";
int cmd_add(int argc, const char **argv, const char *prefix)
{
diff --git a/wt-status.c b/wt-status.c
index 03b5ec4..103aefe 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -23,6 +23,7 @@ static const char use_add_rm_msg[] =
static const char use_add_to_include_msg[] =
"use \"git add <file>...\" to include in what will be committed";
static const char *excludes_file;
+static const char *ignore_file = ".gitignore";
static int parse_status_slot(const char *var, int offset)
{
@@ -257,7 +258,7 @@ static void wt_status_print_untracked(struct wt_status *s)
memset(&dir, 0, sizeof(dir));
- dir.exclude_per_dir = ".gitignore";
+ dir.exclude_per_dir = ignore_file;
if (!s->untracked) {
dir.show_other_directories = 1;
dir.hide_empty_directories = 1;
@@ -370,5 +371,11 @@ int git_status_config(const char *k, const char *v)
excludes_file = xstrdup(v);
return 0;
}
+ if (!strcmp(k, "core.ignorefile")) {
+ if (!v)
+ die("core.ignorefile without value");
+ ignore_file = xstrdup(v);
+ return 0;
+ }
return git_default_config(k, v);
}
--
1.5.3.4.1273.g725c19
^ permalink raw reply related
* [PATCH] git-format-patch: Don't number patches when there's only one
From: Andreas Ericsson @ 2007-10-21 8:13 UTC (permalink / raw)
To: git; +Cc: spearce
[PATCH 1/1] looks a bit silly, and automagically handling
this in git-format-patch makes some scripting around it a
lot more pleasant.
Signed-off-by: Andreas Ericsson <ae@op5.se>
---
Documentation/git-format-patch.txt | 3 ++-
builtin-log.c | 4 ++++
2 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/Documentation/git-format-patch.txt b/Documentation/git-format-patch.txt
index c9857a2..9f16951 100644
--- a/Documentation/git-format-patch.txt
+++ b/Documentation/git-format-patch.txt
@@ -56,7 +56,8 @@ If -o is specified, output files are created in <dir>. Otherwise
they are created in the current working directory.
If -n is specified, instead of "[PATCH] Subject", the first line
-is formatted as "[PATCH n/m] Subject".
+is formatted as "[PATCH n/m] Subject" if the revision range to
+format contains more than one commit.
If given --thread, git-format-patch will generate In-Reply-To and
References headers to make the second and subsequent patch mails appear
diff --git a/builtin-log.c b/builtin-log.c
index e8b982d..5c48f4d 100644
--- a/builtin-log.c
+++ b/builtin-log.c
@@ -642,6 +642,10 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
list[nr - 1] = commit;
}
total = nr;
+
+ /* don't number patches when there's only one */
+ if (total == 1)
+ numbered = 0;
if (numbered)
rev.total = total + start_number - 1;
rev.add_signoff = add_signoff;
--
1.5.3.4.1273.g725c19
^ permalink raw reply related
* Re: What's cooking in git/spearce.git (topics)
From: Pierre Habouzit @ 2007-10-22 7:24 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git
In-Reply-To: <20071022063222.GS14735@spearce.org>
[-- Attachment #1: Type: text/plain, Size: 2790 bytes --]
On Mon, Oct 22, 2007 at 06:32:22AM +0000, Shawn O. Pearce wrote:
> * ph/parseopt (Mon Oct 15 23:06:02 2007 +0200) 22 commits
> - Make builtin-pack-refs.c use parse_options.
> - Make builtin-name-rev.c use parse_options.
> - Make builtin-count-objects.c use parse_options.
> - Make builtin-fsck.c use parse_options.
> - Update manpages to reflect new short and long option aliases
> - Make builtin-for-each-ref.c use parse-opts.
> - Make builtin-symbolic-ref.c use parse_options.
> - Make builtin-update-ref.c use parse_options
> - Make builtin-revert.c use parse_options.
> - Make builtin-describe.c use parse_options
> - Make builtin-branch.c use parse_options.
> - Make builtin-mv.c use parse-options
> - Make builtin-rm.c use parse_options.
> - Port builtin-add.c to use the new option parser.
> - parse-options: allow callbacks to take no arguments at all.
> - parse-options: Allow abbreviated options when unambiguous
> - Add shortcuts for very often used options.
> - parse-options: make some arguments optional, add callbacks.
> - Rework make_usage to print the usage message immediately
> - Add tests for parse-options.c
> - parse-options: be able to generate usages automatically
> - Add a simple option parser.
>
> There's actually a few other commits (3?) missing from the above
> list that are safely parked away in my tree. I'm most of the way
> through reviewing these and have made a few bug fixes and style
> nit corrections in the earlier parts of the series. I'm going to
> try and finish working through this series tomorrow and then will
> probably merge it to next.
> The other 3 commits are hanging off to the side as 2 of them are
> for the top of db/fetch-pack (to port builtin-fetch and friends
> to the new option parser) and the 3rd is a somewhat questionable
> change due to needing to rename a "-h" to a "-H". I just got
> tired of rebasing these 3 other commits onto the respective topics.
> I'm waiting for the core option parser (above) to freeze on a commit
> SHA-1 before I deal with these again.
You also can ask me to resubmit them when the first round of parseopt
is merged. I'm actually waiting for that before I work on it again
(that's not the sole reason, $work ate my time as well, so I'm not
pressuring ;P) and I can send those again when things are more ready for
them.
Like I said to you on IRC, I intend to hack a way to ask parse-options
to dump its options in a machine parseable way, to help {z,ba}sh
completions authors. And obviously I intend to migrate even more
builtins :)
--
·O· Pierre Habouzit
··O madcoder@debian.org
OOO http://www.madism.org
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply
* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
From: Jeff King @ 2007-10-22 7:21 UTC (permalink / raw)
To: skimo
Cc: Linus Torvalds, Git Mailing List, Junio C Hamano, Shawn O. Pearce,
David Kastrup
In-Reply-To: <20071022070750.GM1179MdfPADPa@greensroom.kotnet.org>
On Mon, Oct 22, 2007 at 09:07:50AM +0200, Sven Verdoolaege wrote:
> Aren't you truncating the ptr list after the first entry here?
> (While you still need the whole list in free_file_table.)
Yes, good eyes. And because we actually reverse the list, it's not as
simple as just sticking the two broken up pieces together again; the
original head must end up as the head of the list after they are glued
together again, but it is actually the tail of one of the lists.
-Peff
^ permalink raw reply
* Re: What's cooking in git/spearce.git (topics)
From: Jeff King @ 2007-10-22 7:16 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Linus Torvalds, git
In-Reply-To: <20071022063222.GS14735@spearce.org>
On Mon, Oct 22, 2007 at 02:32:22AM -0400, Shawn O. Pearce wrote:
> * lt/rename (Sun Oct 21 16:59:03 2007 -0700) 2 commits
> - Linear-time/space rename logic (exact renames only)
> - Split out "exact content match" phase of rename detection
>
> t4001-diff-rename.sh failed while running tests on pu. I'm pretty
> sure its this topic from Linus. Need to look at it further.
Hrm, the problem is that it's not favoring basenames anymore. But I
think it is because the loop in find_identical_files is inside out. For
every source file, it picks the best destination file. But I think we
want to do the opposite: for every destination file, pick the best
source file. Otherwise, if a non-basename source file is connected with
a particular destination file, we no longer try that destination file
again.
Patch is below (please just squash with the one from Linus).
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 05d39db..8881818 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -252,17 +252,18 @@ static int find_identical_files(struct file_similarity *src,
{
int renames = 0;
do {
- struct diff_filespec *one = src->filespec;
+ struct diff_filespec *one = dst->filespec;
struct file_similarity *p, *best;
int i = 100;
best = NULL;
- for (p = dst; p; p = p->next) {
+ for (p = src; p; p = p->next) {
struct diff_filespec *two = p->filespec;
- /* Already picked as a destination? */
+ /* Already picked as a source? */
if (!p->src_dst)
continue;
+
/* False hash collission? */
if (hashcmp(one->sha1, two->sha1))
continue;
@@ -276,10 +277,10 @@ static int find_identical_files(struct file_similarity *src,
}
if (best) {
best->src_dst = 0;
- record_rename_pair(best->index, src->index, MAX_SCORE);
+ record_rename_pair(dst->index, best->index, MAX_SCORE);
renames++;
}
- } while ((src = src->next) != NULL);
+ } while ((dst = dst->next) != NULL);
return renames;
}
^ permalink raw reply related
* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
From: Sven Verdoolaege @ 2007-10-22 7:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup,
Jeff King
In-Reply-To: <alpine.LFD.0.999.0710211603200.10525@woody.linux-foundation.org>
On Sun, Oct 21, 2007 at 04:59:03PM -0700, Linus Torvalds wrote:
> +static int find_same_files(void *ptr)
> +{
> + struct file_similarity *p = ptr;
> + struct file_similarity *src = NULL, *dst = NULL;
> +
> + /* Split the hash list up into sources and destinations */
> + do {
> + struct file_similarity *entry = p;
> + p = p->next;
> + if (entry->src_dst < 0) {
> + entry->next = src;
> + src = entry;
> + } else {
> + entry->next = dst;
> + dst = entry;
> + }
> + } while (p);
Aren't you truncating the ptr list after the first entry here?
(While you still need the whole list in free_file_table.)
skimo
^ permalink raw reply
* Re: What's cooking in git/spearce.git (topics)
From: Jeff King @ 2007-10-22 6:59 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git
In-Reply-To: <20071022063222.GS14735@spearce.org>
On Mon, Oct 22, 2007 at 02:32:22AM -0400, Shawn O. Pearce wrote:
> * jk/terse-fetch (Fri Oct 19 03:40:57 2007 -0400) 62 commits
> - park
> - git-fetch: mega-terse fetch output
> ... db/fetch-pack begins here ...
>
> Much discussion on the list about this topic. I think we may have
> come to an agreement about what the output should look like, but
> this topic doesn't implement that output format. Someone needs to
> either update this topic or rewrite it. Starting from Jeff King's
> patch makes things a lot easier.
I will eventually get around to rewriting this (it seems the list
comments have died down), but it is much more interesting to play with
Linus' rename stuff tonight. :) If somebody else wants to take a stab,
please go ahead.
-Peff
^ permalink raw reply
* Re: [PATCH] Be nice with compilers that do not support runtime paths at all.
From: Brian Dessent @ 2007-10-22 6:52 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Benoit SIGOURE, git list
In-Reply-To: <20071022064454.GV14735@spearce.org>
"Shawn O. Pearce" wrote:
> How old of a GNU make are talking about here? The above is certainly
According to the NEWS file, $(call) was added to GNU make v3.78,
released 1999-09-22.
Brian
^ permalink raw reply
* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
From: Jeff King @ 2007-10-22 6:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup
In-Reply-To: <alpine.LFD.0.999.0710211603200.10525@woody.linux-foundation.org>
On Sun, Oct 21, 2007 at 04:59:03PM -0700, Linus Torvalds wrote:
> What it does is to rather than iterate over all sources and destinations
> and checking if they are identical (which is O(src*dst)), it hashes each
> of the sources and destinations into a hash table, using the SHA1 hash of
> the contents as the hash. That's O(n+m). It then walks the hash table
> (which is also O(m+n) in size), and only pairs up files for comparison
> that hashed to the same spot.
>
> Doing this for more than just the exact same contents would be basically
> the same thing, except it starts hashing up fingerprints of the contents
> and linking up file pairs that get linked up by those fingerprints. More
> involved, but not impossible.
Hrm. For the inexact case, it seems like you should be able to do
something like this:
1. put the content fingerprint hashes in a hash table
2. create a single src * dst table of scores
3. walk the hash table; for each bucket in which there are collisions,
find every pair in that bucket, and add to the similarity score in
the big score table
4. for each src in the similarity table, find the maximum dest
Step 1 is clearly O(n+m). Step 2 allocates O(n*m) memory, but we only
need a single integer in each slot. The outer walk in step 3 is
O(fingerprints), which is, O(content_size * (n+m)). The inner loop can
actually be worst case O(n*m), but is more likely to be a handful of
pairs (assuming that for any fingerprint chunk, it is only going to
exist in a few files; you would get worst case if you were comparing a
bunch of files with identical contents). And step 4 is actually going to
end up being O(n*m), since you have to find the best match for each.
So there is actually still O(n*m) behavior, but I wonder if we are
tightening the O(n*m) loop enough that we will see improvement.
Hrm. I wonder if we could keep a "best" pointer for each src file, and
when we update the score for a given src/dest pair, check for a new
best. That would add more to step 3, but make step 4 O(n).
And of course the n*m memory is going to bottleneck. I think for 1000 *
1000, it should be fine, but if you want to try 100,000 * 100,000, you
are going to need tens of gigabytes. And it's going to be very sparse.
So perhaps a hash table with keys of src+dst mapped to scores. And then
we could sort the whole table afterwards, and just run through it
linearly, which helps step 4.
> - it looks ok, and I've tested it some, but this needs more people
> looking at it.
Overall it looks like a sane approach. My comments are below.
> - in fact, the big optimization isn't the actual hash table, but the
> independent and much simpler "diff_filespec->used" optimization for a
> deleted filename that was used for a rename/copy.
Do you have separate timings? The "diff_filespec->used" optimization
appears to be cutting out an O(n*m) strcmp. If it is most of the
optimization, I wonder if the complexity of the hash change is worth it
(although I find the new code easier to read, so maybe it is worth it on
those grounds alone).
> +struct file_similarity {
> + int src_dst, index;
It took me a while to figure out all of the meanings of src_dst; maybe a
comment is in order?
> +static int find_identical_files(struct file_similarity *src,
> + struct file_similarity *dst)
Your function naming is a bit confusing. You have find_identical_files,
find_same_files, and find_exact_renames, all of which do the same thing,
but for different levels of input. Perhaps the names should reflect how
they are different?
> +static int find_identical_files(struct file_similarity *src,
> + struct file_similarity *dst)
> +{
> + int renames = 0;
> + do {
> + struct diff_filespec *one = src->filespec;
> + struct file_similarity *p, *best;
> + int i = 100;
> +
> + best = NULL;
> + for (p = dst; p; p = p->next) {
> + struct diff_filespec *two = p->filespec;
> +
> + /* Already picked as a destination? */
> + if (!p->src_dst)
> + continue;
> + /* False hash collission? */
> + if (hashcmp(one->sha1, two->sha1))
> + continue;
> + best = p;
> + if (basename_same(one, two))
> + break;
> +
> + /* Too many identical alternatives? Pick one */
> + if (!--i)
> + break;
> + }
> + if (best) {
> + best->src_dst = 0;
> + record_rename_pair(best->index, src->index, MAX_SCORE);
> + renames++;
> + }
> + } while ((src = src->next) != NULL);
> + return renames;
> +}
The "too many identical alternatives" sanity check is interesting. It
can produce suboptimal results if, e.g., the 101st entry has the same
basename. I suppose it is necessary to prevent a worst case "oops, all
of these files are identical" O(n*m) behavior. But generally I would
favor "always correct, pathological cases slow" to "always fast,
pathological cases incorrect". But maybe the basename heuristic isn't
worth considering "correct" in this case (and honestly, I doubt anyone
will hit the 100 limit anyway).
> +/*
> + * Note: the rest of the rename logic depends on this
> + * phase also populating all the filespecs for any
> + * entry that isn't matched up with an exact rename.
> + */
> +static int free_file_table(void *ptr)
> +{
> + struct file_similarity *p = ptr;
> + do {
> + struct file_similarity *entry = p;
> + p = p->next;
> +
> + /* Stupid special case, see note above! */
> + diff_populate_filespec(entry->filespec, 0);
> + free(entry);
> + } while (p);
> + return 0;
> +}
diff_populate_filespec knows whether or not it needs to do any work. I
wonder if those parts of the rename process that need it should just
call it unconditionally. It seems like a fragile dependency. Besides
which...
> +static unsigned int hash_filespec(struct diff_filespec *filespec)
> +{
> + unsigned int hash;
> + if (!filespec->sha1_valid) {
> + if (diff_populate_filespec(filespec, 0))
> + return 0;
> + hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
> + }
> + memcpy(&hash, filespec->sha1, sizeof(hash));
> + return hash;
> +}
...if everything has been through this hash function, why do we need to
populate again upon freeing?
> +static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash_table *table)
> +{
> + unsigned int size = table->size, nr = hash % size;
> + struct hash_table_entry *array = table->array;
> +
> + while (array[nr].ptr) {
> + if (array[nr].hash == hash)
> + break;
> + nr++;
> + if (nr >= size)
> + nr = 0;
> + }
> + return array + nr;
> +}
Rather than having buckets where collisions extend in a list within the
bucket, it looks like you are just overflowing into the next bucket.
It seems like you could end up with pretty bad "runs" of full buckets.
E.g., bucket 1 has a collision, so it bleeds into bucket 2. Now when we
place something in bucket 2, it has to skip past bucket 1's overflow.
And a collision in bucket 2 means we have to skip the overflow for
buckets 1 _and_ 2. And so on, until we can resync by finding enough free
buckets to accept all of our collisions. So it depends on keeping a lot
of holes in the hash structure, which I see that you do, but I wonder
what the optimal value is.
I assume you chose this method to reduce memory fragmentation, which
makes sense. And maybe there is a simple answer that "50-75% free gives
good results over evenly distributed hashes." Though perhaps we should
also consider clustered hashes (especially if we want to put the
fingerprint hashes directly in here).
> +/*
> + * Insert a new hash entry pointer into the table.
> + *
> + * If that hash entry already existed, return the pointer to
> + * the existing entry (and the caller can create a list of the
> + * pointers or do anything else). If it didn't exist, return
> + * NULL (and the caller knows the pointer has been inserted).
> + */
> +static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)
This calling convention seems a bit clunky. Since all callers have to
check for hash collision anyway, why not just unconditionally return the
pointer to the data ptr (which will itself be NULL if this is the first
entry)? Then you can drop the conditional, and:
pos = insert_hash(hash, entry, table);
if (pos) {
entry->next = *pos;
*pos = entry;
}
can simply become:
pos = insert_hash(hash, table);
entry->next = *pos;
*pos = entry;
> +void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
> +{
> + unsigned int nr = table->nr;
> + if (nr >= table->size/2)
> + grow_hash_table(table);
> + return insert_hash_entry(hash, ptr, table);
> +}
Style nitpick: the local "nr" is rather pointless.
-Peff
^ permalink raw reply
* Re: [PATCH] Be nice with compilers that do not support runtime paths at all.
From: Shawn O. Pearce @ 2007-10-22 6:44 UTC (permalink / raw)
To: Benoit SIGOURE; +Cc: git list
In-Reply-To: <09169ECD-19E1-44D1-8539-71EBBA3826A8@lrde.epita.fr>
Benoit SIGOURE <tsuna@lrde.epita.fr> wrote:
> >On Oct 4, 2007, at 1:18 AM, Junio C Hamano wrote:
> >>Benoit Sigoure <tsuna@lrde.epita.fr> writes:
> >>
> >>If we do not care about supporting too old GNU make, we can do
> >>this by first adding this near the top:
> >>
> >> ifndef NO_RPATH
> >> LINKER_PATH = -L$(1) $(CC_LD_DYNPATH)$(1)
> >> else
> >> LINKER_PATH = -L$(1)
> >> endif
> >>
> >>and then doing something like:
> >>
> >> CURL_LIBCURL = $(call LINKER_PATH,$(CURLDIR)/$(lib))
> >> OPENSSL_LINK = $(call LINKER_PATH,$(OPENSSLDIR)/$(lib))
> >>
> >>to make it easier to read and less error prone.
> >
> >Yes. I can rework the patch, but the question is: do you care
> >about old GNU make? Can I rewrite the patch with this feature?
>
> I know Junio is still offline but maybe someone else has an objection
> against this?
How old of a GNU make are talking about here? The above is certainly
a lot nicer to read, but I'd hate to suddenly ship a new Git that
someone cannot compile because their GNU make is too old.
GNU make is fortunately pretty easy to compile, so it shouldn't be
that difficult for someone to build a newer version if they had to,
but why make them go through all that extra work just to install
a new Git?
What about using a small helper shell script and using $(shell)
instead of $(call)?
So I guess in short I think I was in agreement with Junio a while
ago on this, which was that I don't want to require a newer GNU
make than we already require our users to have.
--
Shawn.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox