* Rename similarity scoring for file pair
@ 2010-04-12 18:12 Debayan Banerjee
2010-04-12 18:14 ` Debayan Banerjee
2010-04-17 11:03 ` Jeff King
0 siblings, 2 replies; 3+ messages in thread
From: Debayan Banerjee @ 2010-04-12 18:12 UTC (permalink / raw)
To: git
I was going through this thread
<http://markmail.org/message/mge2spy7uqle5ghy#query:diffcore-rename.c%20algorithm+page:1+mid:ji7jkzypzqpml2xn+state:results>
trying to understand how the similarity scoring for modified files
works. I see that the algorithm uses two hashes, one for each file,
and proceeds to compare the 2 hashes to determine percentage
similarity.
I think I have an algorithm which uses only one hash and has a worst
case space complexity of O(N) where N is the number of lines
inserted/deleted/moved.
The algorithm is as follows:
def similarity(leftFileContent, rightFileContent)
{
int lineCounter = 0;
struct HashElement {
HashElement(): leftFileCount(0), rightFileCount(0) // One of these
counters will remain 0 for its lifetime. The other will be >=0
int leftFileCount; // Increment if line is from leftFile
int rightFileCount; // Increment if line is from right file.
};
Hash<lineContent, HashElement> hash; //One hash for both files
int similartyCounter = 0;
while(true) {
leftLline = leftFileContent[lineCounter];
rightLine = rightFileContent[lineCounter];
if (hash.contains(leftLine)) {
HashElement h = hash.value(leftLine);
if (h.leftFileCount > 0) {//The previous occurence has come from
the left file
h.leftFileCount++; //This line is another occurence of the same
line in the same file. So we increment the counter.
}
else {
h.leftFileCount--;
similarityCounter++; //Hey, this line matched an earlier
occurence from the other file!
}
if (h.leftFileCount == 0 && h.rightFileCount == 0)
hash.remove(leftLine);
}
else { // this line is being encountered for the first time, so add
it to the hash
HashElement h;
h.leftFileCount++;
hash.add(leftLine, h);
}
if (hash.contains(rightLine)) {
HashElement h = hash.value(rightLine);
if (h.rightFileCount > 0) {//The previous occurence has come from
the right file
h.rightFileCount++; //This line is another occurence of the same
line in the same file. So we increment the counter.
}
else {
h.rightFileCount--;
similarityCounter++; //Hey, this line matched an earlier
occurence from the other file!
}
if (h.leftFileCount == 0 && h.rightFileCount == 0)
hash.remove(line2);
}
else { // this line is being encountered for the first time, so add
it to the hash
HashElement h;
h.rightFileCount++;
hash.add(line2, h);
}
lineCounter++;
if (lineCounter == leftFileContent.size() || lineCounter ==
rightFileContent.size())
break;
}
/* Upto this point we have processed an equal number of lines in both
the files simultaneously. Now we may have a file which is
larger in size than the other. We need to process the remanant of this file*/
bool leftFileIsLonger;
FileContent fileContent;
int size;
if (leftFileContent.size() == rightFileContent.size())
return ((similarityCounter*100) / rightFileContent.size());
if (leftFileContent.size() >rightFileContent.size()) {
leftFileIsLonger = true;
fileContent = leftFileContent; //fileContent should really be
a pointer to leftFileContent
size = leftFileContent.size();
}
else {
leftFileIsLonger = false;
fileContent = rightFileContent; //fileContent should really be a
pointer to rightFileContent
size = rightFileContent.size();
}
for (i = lineCounter + 1; i < size; i++ ) {
line = fileContent[i];
if (hash.contains(line)) {
Hash h = hash.value(line);
if (h.leftLineCount > 0) { //Line in hash came from left file
if (leftFileIsLonger)
h.leftFileCount++; //Even this line came from the same file, so
its just a repeated line.
else {
h.leftFileCount--;
similarityCounter++; //This line came from the right file, so
increase similarityCounter and decrease the leftCounter.
}
}
else { //Line in hash came from right file
if (leftFileIsLonger) {
h.rightFileCount--;
similarityCounter++;
}
else
h.rightFileCount++;
}
if (h.leftFileCount == 0 && h.rightFileCount == 0)
hash.remove(line);
}
}
return ((similarityCounter*100) / size); //Formula for calculating
similarity is same as git uses
}
--
Debayan Banerjee
Software Engineer
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Rename similarity scoring for file pair
2010-04-12 18:12 Rename similarity scoring for file pair Debayan Banerjee
@ 2010-04-12 18:14 ` Debayan Banerjee
2010-04-17 11:03 ` Jeff King
1 sibling, 0 replies; 3+ messages in thread
From: Debayan Banerjee @ 2010-04-12 18:14 UTC (permalink / raw)
To: git
[-- Attachment #1: Type: text/plain, Size: 242 bytes --]
On 12 April 2010 23:42, Debayan Banerjee <debayanin@gmail.com> wrote:
> I was going through this thread
Sorry about that. Sent it prematurely, and by mistake. Find the
attached file for the algorithm.
--
Debayan Banerjee
Software Engineer
[-- Attachment #2: similarity_algorithm.txt --]
[-- Type: text/plain, Size: 3642 bytes --]
def similarity(leftFileContent, rightFileContent)
{
int lineCounter = 0;
struct HashElement {
HashElement(): leftFileCount(0), rightFileCount(0) // One of these counters will remain 0 for its lifetime. The other will be >=0
int leftFileCount; // Increment if line is from leftFile
int rightFileCount; // Increment if line is from right file.
};
Hash<lineContent, HashElement> hash; //One hash for both files
int similartyCounter = 0;
while(true) {
leftLline = leftFileContent[lineCounter];
rightLine = rightFileContent[lineCounter];
if (hash.contains(leftLine)) {
HashElement h = hash.value(leftLine);
if (h.leftFileCount > 0) {//The previous occurence has come from the left file
h.leftFileCount++; //This line is another occurence of the same line in the same file. So we increment the counter.
}
else {
h.leftFileCount--;
similarityCounter++; //Hey, this line matched an earlier occurence from the other file!
}
if (h.leftFileCount == 0 && h.rightFileCount == 0)
hash.remove(leftLine);
}
else { // this line is being encountered for the first time, so add it to the hash
HashElement h;
h.leftFileCount++;
hash.add(leftLine, h);
}
if (hash.contains(rightLine)) {
HashElement h = hash.value(rightLine);
if (h.rightFileCount > 0) {//The previous occurence has come from the right file
h.rightFileCount++; //This line is another occurence of the same line in the same file. So we increment the counter.
}
else {
h.rightFileCount--;
similarityCounter++; //Hey, this line matched an earlier occurence from the other file!
}
if (h.leftFileCount == 0 && h.rightFileCount == 0)
hash.remove(line2);
}
else { // this line is being encountered for the first time, so add it to the hash
HashElement h;
h.rightFileCount++;
hash.add(line2, h);
}
lineCounter++;
if (lineCounter == leftFileContent.size() || lineCounter == rightFileContent.size())
break;
}
/* Upto this point we have processed an equal number of lines in both the files simultaneously. Now we may have a file which is
larger in size than the other. We need to process the remanant of this file*/
bool leftFileIsLonger;
FileContent fileContent;
int size;
if (leftFileContent.size() == rightFileContent.size())
return ((similarityCounter*100) / rightFileContent.size());
if (leftFileContent.size() >rightFileContent.size()) {
leftFileIsLonger = true;
fileContent = leftFileContent; //fileContent should really be a pointer to leftFileContent
size = leftFileContent.size();
}
else {
leftFileIsLonger = false;
fileContent = rightFileContent; //fileContent should really be a pointer to rightFileContent
size = rightFileContent.size();
}
for (i = lineCounter + 1; i < size; i++ ) {
line = fileContent[i];
if (hash.contains(line)) {
Hash h = hash.value(line);
if (h.leftLineCount > 0) { //Line in hash came from left file
if (leftFileIsLonger)
h.leftFileCount++; //Even this line came from the same file, so its just a repeated line.
else {
h.leftFileCount--;
similarityCounter++; //This line came from the right file, so increase similarityCounter and decrease the leftCounter.
}
}
else { //Line in hash came from right file
if (leftFileIsLonger) {
h.rightFileCount--;
similarityCounter++;
}
else
h.rightFileCount++;
}
if (h.leftFileCount == 0 && h.rightFileCount == 0)
hash.remove(line);
}
}
return ((similarityCounter*100) / size); //Formula for calculating similarity is same as git uses
}
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Rename similarity scoring for file pair
2010-04-12 18:12 Rename similarity scoring for file pair Debayan Banerjee
2010-04-12 18:14 ` Debayan Banerjee
@ 2010-04-17 11:03 ` Jeff King
1 sibling, 0 replies; 3+ messages in thread
From: Jeff King @ 2010-04-17 11:03 UTC (permalink / raw)
To: Debayan Banerjee; +Cc: git
On Mon, Apr 12, 2010 at 11:42:59PM +0530, Debayan Banerjee wrote:
> I was going through this thread
> <http://markmail.org/message/mge2spy7uqle5ghy#query:diffcore-rename.c%20algorithm+page:1+mid:ji7jkzypzqpml2xn+state:results>
> trying to understand how the similarity scoring for modified files
> works. I see that the algorithm uses two hashes, one for each file,
> and proceeds to compare the 2 hashes to determine percentage
> similarity.
> I think I have an algorithm which uses only one hash and has a worst
> case space complexity of O(N) where N is the number of lines
> inserted/deleted/moved.
That looks pretty similar to the algorithm that Andy Chup proposed. I
did some work towards integrating it, but never completed it. See these
threads for details:
http://thread.gmane.org/gmane.comp.version-control.git/61975
http://thread.gmane.org/gmane.comp.version-control.git/62655
So yes, it's an interesting way to go. But it needs somebody to:
1. Implement it in C and integrate it into diffcore-rename (I did
that, but there were some problems, see below).
2. Confirm that the quality of results is similar to the current
algorithm. In my case, the results were different, and I suggested
some tweaks to improve them (see the second thread above). I don't
remember if I ever followed up.
3. Confirm that we really do get a speed improvement on the slow
cases, and don't slow down or significantly bloat memory usage for
the fast cases.
If you're interested in working on it, I'd be happy to help with
discussion and code review.
-Peff
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-04-17 11:04 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-12 18:12 Rename similarity scoring for file pair Debayan Banerjee
2010-04-12 18:14 ` Debayan Banerjee
2010-04-17 11:03 ` Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).