git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Teemu Likonen <tlikonen@iki.fi>
Cc: Junio C Hamano <gitster@pobox.com>,
	Ittay Dror <ittayd@tikalk.com>,
	git@vger.kernel.org
Subject: Re: detecting rename->commit->modify->commit
Date: Thu, 1 May 2008 19:09:25 -0400	[thread overview]
Message-ID: <20080501230925.GC21731@sigill.intra.peff.net> (raw)
In-Reply-To: <20080501203940.GA3524@mithlond.arda.local>

[cc'd Junio for comments on this rename optimization]

On Thu, May 01, 2008 at 11:39:40PM +0300, Teemu Likonen wrote:

> > Hmm, looking at the code, though, 50% is supposed to be the default
> > minimum. So there might actually be a bug.
> 
> I did some testing... A file, containing 10 lines (about 200 bytes),
> renamed and then modified (similarity index being a bit over 50%). Git

Ah, OK. The problem comes because the toy example is so tiny. It hits
this code chunk:

  if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
          return 0;

where base_size is the size of the smaller file in bytes, and delta_size
is the difference between the size of the two files. This is an
optimization so that we don't even have to look at the contents.

But it is basing the percentage off of the smaller file, so even though
file B ("hello\nworld\n") is 50% made up of file A ("hello\n"), we
actually end up saying "there must be at least as much content added to
make B as there is in A already". IOW, the "percentage similarity" is
based off of the smaller file for this optimization.

Obviously this is a toy case, but I wonder if there are other larger
cases where you end up with a file which has substantial copied content,
but also _grows_ a lot (not just changes). For example, consider the
file:

  1
  2
  3
  4
  5
  6
  7
  8
  9

that is, ten lines each with a number. Now rename it, and start adding
more numbers. We detect the addition of 10, 11, 12. But adding 13 means
we no longer match. So even with only 4 lines added, we fail to match.

But again, this is a bit of a toy case. It relies on the line length
being a significant factor compared to number of lines.

-Peff

  reply	other threads:[~2008-05-01 23:10 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-01 14:10 detecting rename->commit->modify->commit Ittay Dror
2008-05-01 14:45 ` Jeff King
2008-05-01 15:08   ` Ittay Dror
2008-05-01 15:20     ` Jeff King
2008-05-01 15:30       ` Ittay Dror
2008-05-01 15:38         ` Jeff King
2008-05-01 15:47         ` Jakub Narebski
2008-05-01 20:39       ` Teemu Likonen
2008-05-01 23:09         ` Jeff King [this message]
2008-05-02  2:06         ` Sitaram Chamarty
2008-05-02  2:38           ` Junio C Hamano
2008-05-02 16:59             ` Sitaram Chamarty
2008-05-01 15:24     ` Ittay Dror
2008-05-01 15:28       ` Jeff King
2008-05-01 14:54 ` Ittay Dror
2008-05-01 15:09   ` Jeff King
2008-05-01 15:20     ` Ittay Dror
2008-05-01 15:30     ` David Tweed
2008-05-01 15:27   ` Avery Pennarun
2008-05-01 15:34     ` Jeff King
2008-05-01 15:50       ` Avery Pennarun
2008-05-01 16:48         ` Jeff King
2008-05-01 19:45           ` Avery Pennarun
2008-05-01 22:42             ` Jeff King
2008-05-01 19:12       ` Steven Grimm
2008-05-01 23:14         ` Jeff King
2008-05-03 17:56           ` merge renamed files/directories? (was: Re: detecting rename->commit->modify->commit) Ittay Dror
2008-05-03 18:11             ` Avery Pennarun
2008-05-04  6:08               ` merge renamed files/directories? Ittay Dror
2008-05-04  9:34                 ` Jakub Narebski
2008-05-05 16:40                 ` Avery Pennarun
2008-05-05 21:49                   ` Robin Rosenberg
2008-05-05 22:20                     ` Linus Torvalds
2008-05-05 23:07                       ` Steven Grimm
2008-05-06  0:29                         ` Linus Torvalds
2008-05-06  0:40                           ` Linus Torvalds
2008-05-06 15:47                           ` Theodore Tso
2008-05-06 16:10                             ` Linus Torvalds
2008-05-06 16:15                               ` Linus Torvalds
2008-05-06 16:32                               ` Ittay Dror
2008-05-06 16:39                                 ` Linus Torvalds
2008-05-06  1:38                       ` Avery Pennarun
2008-05-06  1:46                         ` Shawn O. Pearce
2008-05-06  1:58                           ` Avery Pennarun
2008-05-06  2:12                             ` Shawn O. Pearce
2008-05-06  2:19                         ` Linus Torvalds
2008-05-08 18:17           ` detecting rename->commit->modify->commit Jeff King
2008-05-01 16:39   ` Sitaram Chamarty
2008-05-01 18:58     ` Ittay Dror

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080501230925.GC21731@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=ittayd@tikalk.com \
    --cc=tlikonen@iki.fi \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).