Re: Rename similarity scoring for file pair

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Debayan Banerjee <debayanin@gmail.com>
To: git@vger.kernel.org
Subject: Re: Rename similarity scoring for file pair
Date: Mon, 12 Apr 2010 23:44:57 +0530	[thread overview]
Message-ID: <m2sa2cc2dbc1004121114j5ac8e00amc4eb426e99e2a9cc@mail.gmail.com> (raw)
In-Reply-To: <t2ua2cc2dbc1004121112h64f5b61ela57f5fa83abf51f@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 242 bytes --]

On 12 April 2010 23:42, Debayan Banerjee <debayanin@gmail.com> wrote:
> I was going through this thread

Sorry about that. Sent it prematurely, and by mistake. Find the
attached file for the algorithm.

-- 
Debayan Banerjee
Software Engineer

[-- Attachment #2: similarity_algorithm.txt --]
[-- Type: text/plain, Size: 3642 bytes --]

def similarity(leftFileContent, rightFileContent)
{
	int lineCounter = 0;
	struct HashElement  {
		HashElement(): leftFileCount(0), rightFileCount(0) // One of these counters will remain 0 for its lifetime. The other will be >=0
		int leftFileCount; // Increment if line is from leftFile
		int rightFileCount; // Increment if line is from right file.
	};
	Hash<lineContent, HashElement> hash; //One hash for both files  
	int similartyCounter = 0;
	while(true)  {
		leftLline = leftFileContent[lineCounter];
		rightLine = rightFileContent[lineCounter];
		if (hash.contains(leftLine))  {
			HashElement h = hash.value(leftLine);
			if (h.leftFileCount > 0)   {//The previous occurence has come from the left file
				h.leftFileCount++;  //This line is another occurence of the same line in the same file. So we increment the counter.
			}
			else  {
				h.leftFileCount--;
				similarityCounter++;  //Hey, this line matched an earlier occurence from the other file! 
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
				hash.remove(leftLine);			
		}
		else  { // this line is being encountered for the first time, so add it to the hash
			HashElement h;
			h.leftFileCount++;
			hash.add(leftLine, h);
		}
		
		if (hash.contains(rightLine))  {
			HashElement h = hash.value(rightLine);
			if (h.rightFileCount > 0)   {//The previous occurence has come from the right file
				h.rightFileCount++;  //This line is another occurence of the same line in the same file. So we increment the counter.
			}
			else  {
				h.rightFileCount--;
				similarityCounter++;  //Hey, this line matched an earlier occurence from the other file! 
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
				hash.remove(line2);			
		}
		else  { // this line is being encountered for the first time, so add it to the hash
			HashElement h;
			h.rightFileCount++;
			hash.add(line2, h);			
		}
		lineCounter++;
		if (lineCounter == leftFileContent.size() || lineCounter == rightFileContent.size())
			break;
	}
	/* Upto this point we have processed an equal number of lines in both the files simultaneously. Now we may have a file which is
	larger in size than the other. We need to process the remanant of this file*/
	bool leftFileIsLonger;
	FileContent fileContent;
	int size;
	if (leftFileContent.size() == rightFileContent.size())		
		return ((similarityCounter*100) / rightFileContent.size());
	if (leftFileContent.size() >rightFileContent.size())  {
		leftFileIsLonger = true;
	        fileContent = leftFileContent; //fileContent should really be  a pointer to leftFileContent
		size = leftFileContent.size();
        }
	else  {
		leftFileIsLonger = false;
		fileContent = rightFileContent; //fileContent should really be  a pointer to rightFileContent
		size = rightFileContent.size();
	}
	for (i = lineCounter + 1;  i < size; i++ )  {
		line = fileContent[i];
		if (hash.contains(line))  {
			Hash h = hash.value(line);
			if (h.leftLineCount > 0)  { //Line in hash came from left file
				if (leftFileIsLonger)
					h.leftFileCount++; //Even this line came from the same file, so its just a repeated line.
				else  {
					h.leftFileCount--;
					similarityCounter++;  //This line came from the right file, so increase similarityCounter and decrease the leftCounter.
				}
			}
			else  { //Line in hash came from right file
				if (leftFileIsLonger)  {
					h.rightFileCount--;
					similarityCounter++;
				}
				else  
					h.rightFileCount++;
			}
			if (h.leftFileCount == 0 && h.rightFileCount == 0)
					hash.remove(line);
		}
	}
    return ((similarityCounter*100) / size); //Formula for calculating similarity is same as git uses
}

next prev parent reply	other threads:[~2010-04-12 18:15 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-12 18:12 Rename similarity scoring for file pair Debayan Banerjee
2010-04-12 18:14 ` Debayan Banerjee [this message]
2010-04-17 11:03 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m2sa2cc2dbc1004121114j5ac8e00amc4eb426e99e2a9cc@mail.gmail.com \
    --to=debayanin@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).