From: Junio C Hamano <gitster@pobox.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>
Subject: [PATCH 3/4] diffcore-delta.c: update the comment on the algorithm.
Date: Thu, 28 Jun 2007 23:36:00 -0700 [thread overview]
Message-ID: <11830989623983-git-send-email-gitster@pobox.com> (raw)
In-Reply-To: <7v3b0bi88r.fsf@assigned-by-dhcp.pobox.com>
The comment at the top of the file described an old algorithm
that was neutral to text/binary differences (it hashed sliding
window of N-byte sequences and counted overlaps), but long time
ago we switched to a new heuristics that are more suitable for
line oriented (read: text) files that are much faster.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
diffcore-delta.c | 21 +++++++++------------
1 files changed, 9 insertions(+), 12 deletions(-)
diff --git a/diffcore-delta.c b/diffcore-delta.c
index 0e1fae7..7116df0 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -5,23 +5,20 @@
/*
* Idea here is very simple.
*
- * We have total of (sz-N+1) N-byte overlapping sequences in buf whose
- * size is sz. If the same N-byte sequence appears in both source and
- * destination, we say the byte that starts that sequence is shared
- * between them (i.e. copied from source to destination).
+ * Almost all data we are interested in are text, but sometimes we have
+ * to deal with binary data. So we cut them into chunks delimited by
+ * LF byte, or 64-byte sequence, whichever comes first, and hash them.
*
- * For each possible N-byte sequence, if the source buffer has more
- * instances of it than the destination buffer, that means the
- * difference are the number of bytes not copied from source to
- * destination. If the counts are the same, everything was copied
- * from source to destination. If the destination has more,
- * everything was copied, and destination added more.
+ * For those chunks, if the source buffer has more instances of it
+ * than the destination buffer, that means the difference are the
+ * number of bytes not copied from source to destination. If the
+ * counts are the same, everything was copied from source to
+ * destination. If the destination has more, everything was copied,
+ * and destination added more.
*
* We are doing an approximation so we do not really have to waste
* memory by actually storing the sequence. We just hash them into
* somewhere around 2^16 hashbuckets and count the occurrences.
- *
- * The length of the sequence is arbitrarily set to 8 for now.
*/
/* Wild guess at the initial hash size */
--
1.5.2.2.1414.g1e7d9
next prev parent reply other threads:[~2007-06-29 6:36 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <93c3eada0706280153w1898be80u7785ef2b2b1dd188@mail.gmail.com>
2007-06-29 6:07 ` Applying patches in a directory that isn't a repository Geoff Russell
2007-06-29 6:29 ` Junio C Hamano
2007-06-29 6:35 ` [PATCH 1/4] diffcore_count_changes: pass diffcore_filespec Junio C Hamano
2007-06-29 6:35 ` [PATCH 2/4] diffcore_filespec: add is_binary Junio C Hamano
2007-06-29 6:36 ` Junio C Hamano [this message]
2007-06-29 6:36 ` [PATCH 4/4] diffcore-delta.c: Ignore CR in CRLF for text files Junio C Hamano
2007-06-29 8:14 ` しらいしななこ
2007-06-29 8:51 ` Junio C Hamano
2007-06-30 4:18 ` Applying patches in a directory that isn't a repository Geoff Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=11830989623983-git-send-email-gitster@pobox.com \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.