git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org
Subject: Re: Fix up diffcore-rename scoring
Date: Mon, 13 Mar 2006 07:38:53 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org> (raw)
In-Reply-To: <7vzmjupqv0.fsf@assigned-by-dhcp.cox.net>



On Mon, 13 Mar 2006, Junio C Hamano wrote:
> 
> By the way, the reason the diffcore-delta code in "next" does
> not do every-eight-bytes hash on the source material is to
> somewhat alleviate the problem that comes from not detecting
> copying of consecutive byte ranges.

Yes. However, there are better ways to do that in practice.

The most effective way that is generally used is to not use a fixed 
chunk-size, but use a terminating character, together with a 
minimum/maximum chunksize.

There's a pretty natural terminating character that works well for 
sources: '\n'.

So the natural way to do similarity detection when most of the code is 
line-based is to do the hashing on chunks that follow the rule "minimum of 
<n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".

So if you don't see any '\n' at all (or the only such one is less than <n> 
bytes into your current window), do the hash over a <2n>-byte chunk (this 
takes care of binaries and/or long lines).

This - for source code - allows you to ignore trivial byte offset things, 
because you have a character that is used for synchronization. So you 
don't need to do hashing at every byte in both files - you end up doing 
the hashing only at line boundaries in practice. And it still _works_ for 
binary files, although you effectively need bigger identical chunk-sizes 
to find similarities (for text-files, it finds similarities of size <n>, 
for binaries the similarities need to effectively be of size 3*n, because 
you chunk it up at ~2*n, and only generate the hash at certain offsets in 
the source binary).

		Linus

  reply	other threads:[~2006-03-13 15:39 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-03-13  6:26 Fix up diffcore-rename scoring Linus Torvalds
2006-03-13  6:44 ` Linus Torvalds
2006-03-13  6:46 ` Junio C Hamano
2006-03-13  7:09   ` Linus Torvalds
2006-03-13  7:42     ` Junio C Hamano
2006-03-13  7:44     ` Linus Torvalds
2006-03-13 10:43       ` Junio C Hamano
2006-03-13 15:38         ` Linus Torvalds [this message]
2006-03-14  0:49           ` Rutger Nijlunsing
2006-03-14  0:55           ` Junio C Hamano
2006-04-06 21:01       ` Geert Bosch
2006-04-11 22:04         ` Junio C Hamano
2006-04-14 17:46           ` Geert Bosch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0603130727350.3618@g5.osdl.org \
    --to=torvalds@osdl.org \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).