Rename similarity scoring for file pair

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Rename similarity scoring for file pair
@ 2010-04-12 18:12 Debayan Banerjee
  2010-04-12 18:14 ` Debayan Banerjee
  2010-04-17 11:03 ` Jeff King
  0 siblings, 2 replies; 3+ messages in thread
From: Debayan Banerjee @ 2010-04-12 18:12 UTC (permalink / raw)
  To: git

I was going through this thread
<http://markmail.org/message/mge2spy7uqle5ghy#query:diffcore-rename.c%20algorithm+page:1+mid:ji7jkzypzqpml2xn+state:results>
trying to understand how the similarity scoring for modified files
works. I see that the algorithm uses two hashes, one for each file,
and proceeds to compare the 2 hashes to determine percentage
similarity.
I think I have an algorithm which uses only one hash and has a worst
case space complexity of O(N) where N is the number of lines
inserted/deleted/moved.
The algorithm is as follows:

def similarity(leftFileContent, rightFileContent)
{
	int lineCounter = 0;
	struct HashElement  {
	HashElement(): leftFileCount(0), rightFileCount(0) // One of these
counters will remain 0 for its lifetime. The other will be >=0
		int leftFileCount; // Increment if line is from leftFile
		int rightFileCount; // Increment if line is from right file.
	};
	Hash<lineContent, HashElement> hash; //One hash for both files
	int similartyCounter = 0;
	while(true)  {
		leftLline = leftFileContent[lineCounter];
		rightLine = rightFileContent[lineCounter];
		if (hash.contains(leftLine))  {
			HashElement h = hash.value(leftLine);
			if (h.leftFileCount > 0)   {//The previous occurence has come from
the left file
				h.leftFileCount++;  //This line is another occurence of the same
line in the same file. So we increment the counter.
			}
			else  {
				h.leftFileCount--;
				similarityCounter++;  //Hey, this line matched an earlier
occurence from the other file!
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
				hash.remove(leftLine);			
		}
		else  { // this line is being encountered for the first time, so add
it to the hash
			HashElement h;
			h.leftFileCount++;
			hash.add(leftLine, h);
		}
		
		if (hash.contains(rightLine))  {
			HashElement h = hash.value(rightLine);
			if (h.rightFileCount > 0)   {//The previous occurence has come from
the right file
				h.rightFileCount++;  //This line is another occurence of the same
line in the same file. So we increment the counter.
			}
			else  {
				h.rightFileCount--;
				similarityCounter++;  //Hey, this line matched an earlier
occurence from the other file!
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
				hash.remove(line2);			
		}
		else  { // this line is being encountered for the first time, so add
it to the hash
			HashElement h;
			h.rightFileCount++;
			hash.add(line2, h);			
		}
		lineCounter++;
		if (lineCounter == leftFileContent.size() || lineCounter ==
rightFileContent.size())
			break;
	}
	/* Upto this point we have processed an equal number of lines in both
the files simultaneously. Now we may have a file which is
	larger in size than the other. We need to process the remanant of this file*/
	bool leftFileIsLonger;
	FileContent fileContent;
	int size;
	if (leftFileContent.size() == rightFileContent.size())		
		return ((similarityCounter*100) / rightFileContent.size());
	if (leftFileContent.size() >rightFileContent.size())  {
		leftFileIsLonger = true;
	        fileContent = leftFileContent; //fileContent should really be
 a pointer to leftFileContent
		size = leftFileContent.size();
        }
	else  {
		leftFileIsLonger = false;
		fileContent = rightFileContent; //fileContent should really be  a
pointer to rightFileContent
		size = rightFileContent.size();
	}
	for (i = lineCounter + 1;  i < size; i++ )  {
		line = fileContent[i];
		if (hash.contains(line))  {
			Hash h = hash.value(line);
			if (h.leftLineCount > 0)  { //Line in hash came from left file
				if (leftFileIsLonger)
					h.leftFileCount++; //Even this line came from the same file, so
its just a repeated line.
				else  {
					h.leftFileCount--;
					similarityCounter++;  //This line came from the right file, so
increase similarityCounter and decrease the leftCounter.
				}
			}
			else  { //Line in hash came from right file
				if (leftFileIsLonger)  {
					h.rightFileCount--;
					similarityCounter++;
				}
				else
					h.rightFileCount++;
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
					hash.remove(line);
		}
	}
    return ((similarityCounter*100) / size); //Formula for calculating
similarity is same as git uses
}

--
Debayan Banerjee
Software Engineer

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Rename similarity scoring for file pair
  2010-04-12 18:12 Rename similarity scoring for file pair Debayan Banerjee
@ 2010-04-12 18:14 ` Debayan Banerjee
  2010-04-17 11:03 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Debayan Banerjee @ 2010-04-12 18:14 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 242 bytes --]

On 12 April 2010 23:42, Debayan Banerjee <debayanin@gmail.com> wrote:
> I was going through this thread

Sorry about that. Sent it prematurely, and by mistake. Find the
attached file for the algorithm.

-- 
Debayan Banerjee
Software Engineer

[-- Attachment #2: similarity_algorithm.txt --]
[-- Type: text/plain, Size: 3642 bytes --]

def similarity(leftFileContent, rightFileContent)
{
	int lineCounter = 0;
	struct HashElement  {
		HashElement(): leftFileCount(0), rightFileCount(0) // One of these counters will remain 0 for its lifetime. The other will be >=0
		int leftFileCount; // Increment if line is from leftFile
		int rightFileCount; // Increment if line is from right file.
	};
	Hash<lineContent, HashElement> hash; //One hash for both files  
	int similartyCounter = 0;
	while(true)  {
		leftLline = leftFileContent[lineCounter];
		rightLine = rightFileContent[lineCounter];
		if (hash.contains(leftLine))  {
			HashElement h = hash.value(leftLine);
			if (h.leftFileCount > 0)   {//The previous occurence has come from the left file
				h.leftFileCount++;  //This line is another occurence of the same line in the same file. So we increment the counter.
			}
			else  {
				h.leftFileCount--;
				similarityCounter++;  //Hey, this line matched an earlier occurence from the other file! 
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
				hash.remove(leftLine);			
		}
		else  { // this line is being encountered for the first time, so add it to the hash
			HashElement h;
			h.leftFileCount++;
			hash.add(leftLine, h);
		}
		
		if (hash.contains(rightLine))  {
			HashElement h = hash.value(rightLine);
			if (h.rightFileCount > 0)   {//The previous occurence has come from the right file
				h.rightFileCount++;  //This line is another occurence of the same line in the same file. So we increment the counter.
			}
			else  {
				h.rightFileCount--;
				similarityCounter++;  //Hey, this line matched an earlier occurence from the other file! 
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
				hash.remove(line2);			
		}
		else  { // this line is being encountered for the first time, so add it to the hash
			HashElement h;
			h.rightFileCount++;
			hash.add(line2, h);			
		}
		lineCounter++;
		if (lineCounter == leftFileContent.size() || lineCounter == rightFileContent.size())
			break;
	}
	/* Upto this point we have processed an equal number of lines in both the files simultaneously. Now we may have a file which is
	larger in size than the other. We need to process the remanant of this file*/
	bool leftFileIsLonger;
	FileContent fileContent;
	int size;
	if (leftFileContent.size() == rightFileContent.size())		
		return ((similarityCounter*100) / rightFileContent.size());
	if (leftFileContent.size() >rightFileContent.size())  {
		leftFileIsLonger = true;
	        fileContent = leftFileContent; //fileContent should really be  a pointer to leftFileContent
		size = leftFileContent.size();
        }
	else  {
		leftFileIsLonger = false;
		fileContent = rightFileContent; //fileContent should really be  a pointer to rightFileContent
		size = rightFileContent.size();
	}
	for (i = lineCounter + 1;  i < size; i++ )  {
		line = fileContent[i];
		if (hash.contains(line))  {
			Hash h = hash.value(line);
			if (h.leftLineCount > 0)  { //Line in hash came from left file
				if (leftFileIsLonger)
					h.leftFileCount++; //Even this line came from the same file, so its just a repeated line.
				else  {
					h.leftFileCount--;
					similarityCounter++;  //This line came from the right file, so increase similarityCounter and decrease the leftCounter.
				}
			}
			else  { //Line in hash came from right file
				if (leftFileIsLonger)  {
					h.rightFileCount--;
					similarityCounter++;
				}
				else  
					h.rightFileCount++;
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
					hash.remove(line);
		}
	}
    return ((similarityCounter*100) / size); //Formula for calculating similarity is same as git uses
}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Rename similarity scoring for file pair
  2010-04-12 18:12 Rename similarity scoring for file pair Debayan Banerjee
  2010-04-12 18:14 ` Debayan Banerjee
@ 2010-04-17 11:03 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Jeff King @ 2010-04-17 11:03 UTC (permalink / raw)
  To: Debayan Banerjee; +Cc: git

On Mon, Apr 12, 2010 at 11:42:59PM +0530, Debayan Banerjee wrote:

> I was going through this thread
> <http://markmail.org/message/mge2spy7uqle5ghy#query:diffcore-rename.c%20algorithm+page:1+mid:ji7jkzypzqpml2xn+state:results>
> trying to understand how the similarity scoring for modified files
> works. I see that the algorithm uses two hashes, one for each file,
> and proceeds to compare the 2 hashes to determine percentage
> similarity.
> I think I have an algorithm which uses only one hash and has a worst
> case space complexity of O(N) where N is the number of lines
> inserted/deleted/moved.

That looks pretty similar to the algorithm that Andy Chup proposed. I
did some work towards integrating it, but never completed it. See these
threads for details:

  http://thread.gmane.org/gmane.comp.version-control.git/61975

  http://thread.gmane.org/gmane.comp.version-control.git/62655

So yes, it's an interesting way to go. But it needs somebody to:

  1. Implement it in C and integrate it into diffcore-rename (I did
     that, but there were some problems, see below).

  2. Confirm that the quality of results is similar to the current
     algorithm. In my case, the results were different, and I suggested
     some tweaks to improve them (see the second thread above). I don't
     remember if I ever followed up.

  3. Confirm that we really do get a speed improvement on the slow
     cases, and don't slow down or significantly bloat memory usage for
     the fast cases.

If you're interested in working on it, I'd be happy to help with
discussion and code review.

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-04-17 11:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-12 18:12 Rename similarity scoring for file pair Debayan Banerjee
2010-04-12 18:14 ` Debayan Banerjee
2010-04-17 11:03 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).