From: Linus Torvalds <torvalds@osdl.org>
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org
Subject: Re: Fix up diffcore-rename scoring
Date: Mon, 13 Mar 2006 07:38:53 -0800 (PST) [thread overview]
Message-ID: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org> (raw)
In-Reply-To: <7vzmjupqv0.fsf@assigned-by-dhcp.cox.net>
On Mon, 13 Mar 2006, Junio C Hamano wrote:
>
> By the way, the reason the diffcore-delta code in "next" does
> not do every-eight-bytes hash on the source material is to
> somewhat alleviate the problem that comes from not detecting
> copying of consecutive byte ranges.
Yes. However, there are better ways to do that in practice.
The most effective way that is generally used is to not use a fixed
chunk-size, but use a terminating character, together with a
minimum/maximum chunksize.
There's a pretty natural terminating character that works well for
sources: '\n'.
So the natural way to do similarity detection when most of the code is
line-based is to do the hashing on chunks that follow the rule "minimum of
<n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".
So if you don't see any '\n' at all (or the only such one is less than <n>
bytes into your current window), do the hash over a <2n>-byte chunk (this
takes care of binaries and/or long lines).
This - for source code - allows you to ignore trivial byte offset things,
because you have a character that is used for synchronization. So you
don't need to do hashing at every byte in both files - you end up doing
the hashing only at line boundaries in practice. And it still _works_ for
binary files, although you effectively need bigger identical chunk-sizes
to find similarities (for text-files, it finds similarities of size <n>,
for binaries the similarities need to effectively be of size 3*n, because
you chunk it up at ~2*n, and only generate the hash at certain offsets in
the source binary).
Linus
next prev parent reply other threads:[~2006-03-13 15:39 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-03-13 6:26 Fix up diffcore-rename scoring Linus Torvalds
2006-03-13 6:44 ` Linus Torvalds
2006-03-13 6:46 ` Junio C Hamano
2006-03-13 7:09 ` Linus Torvalds
2006-03-13 7:42 ` Junio C Hamano
2006-03-13 7:44 ` Linus Torvalds
2006-03-13 10:43 ` Junio C Hamano
2006-03-13 15:38 ` Linus Torvalds [this message]
2006-03-14 0:49 ` Rutger Nijlunsing
2006-03-14 0:55 ` Junio C Hamano
2006-04-06 21:01 ` Geert Bosch
2006-04-11 22:04 ` Junio C Hamano
2006-04-14 17:46 ` Geert Bosch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0603130727350.3618@g5.osdl.org \
--to=torvalds@osdl.org \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).