* [PATCH 1/4] diffcore_count_changes: pass diffcore_filespec
2007-06-29 6:29 ` Junio C Hamano
@ 2007-06-29 6:35 ` Junio C Hamano
2007-06-29 6:35 ` [PATCH 2/4] diffcore_filespec: add is_binary Junio C Hamano
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2007-06-29 6:35 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano
We may want to use richer information on the data we are dealing
with in this function, so instead of passing a buffer address
and length, just pass the diffcore_filespec structure. Existing
callers always call this function with parameters taken from a
filespec anyway, so there is no functionality changes.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
diffcore-break.c | 3 +--
diffcore-delta.c | 8 ++++----
diffcore-rename.c | 3 +--
diffcore.h | 4 ++--
4 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/diffcore-break.c b/diffcore-break.c
index 9c19b8c..ae8a7d0 100644
--- a/diffcore-break.c
+++ b/diffcore-break.c
@@ -66,8 +66,7 @@ static int should_break(struct diff_filespec *src,
if (base_size < MINIMUM_BREAK_SIZE)
return 0; /* we do not break too small filepair */
- if (diffcore_count_changes(src->data, src->size,
- dst->data, dst->size,
+ if (diffcore_count_changes(src, dst,
NULL, NULL,
0,
&src_copied, &literal_added))
diff --git a/diffcore-delta.c b/diffcore-delta.c
index 7338a40..0e1fae7 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -156,8 +156,8 @@ static struct spanhash_top *hash_chars(unsigned char *buf, unsigned int sz)
return hash;
}
-int diffcore_count_changes(void *src, unsigned long src_size,
- void *dst, unsigned long dst_size,
+int diffcore_count_changes(struct diff_filespec *src,
+ struct diff_filespec *dst,
void **src_count_p,
void **dst_count_p,
unsigned long delta_limit,
@@ -172,14 +172,14 @@ int diffcore_count_changes(void *src, unsigned long src_size,
if (src_count_p)
src_count = *src_count_p;
if (!src_count) {
- src_count = hash_chars(src, src_size);
+ src_count = hash_chars(src->data, src->size);
if (src_count_p)
*src_count_p = src_count;
}
if (dst_count_p)
dst_count = *dst_count_p;
if (!dst_count) {
- dst_count = hash_chars(dst, dst_size);
+ dst_count = hash_chars(dst->data, dst->size);
if (dst_count_p)
*dst_count_p = dst_count;
}
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 79c984c..cb22736 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -189,8 +189,7 @@ static int estimate_similarity(struct diff_filespec *src,
delta_limit = (unsigned long)
(base_size * (MAX_SCORE-minimum_score) / MAX_SCORE);
- if (diffcore_count_changes(src->data, src->size,
- dst->data, dst->size,
+ if (diffcore_count_changes(src, dst,
&src->cnt_data, &dst->cnt_data,
delta_limit,
&src_copied, &literal_added))
diff --git a/diffcore.h b/diffcore.h
index 7b9294e..990dec5 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -103,8 +103,8 @@ void diff_debug_queue(const char *, struct diff_queue_struct *);
#define diff_debug_queue(a,b) do {} while(0)
#endif
-extern int diffcore_count_changes(void *src, unsigned long src_size,
- void *dst, unsigned long dst_size,
+extern int diffcore_count_changes(struct diff_filespec *src,
+ struct diff_filespec *dst,
void **src_count_p,
void **dst_count_p,
unsigned long delta_limit,
--
1.5.2.2.1414.g1e7d9
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 2/4] diffcore_filespec: add is_binary
2007-06-29 6:29 ` Junio C Hamano
2007-06-29 6:35 ` [PATCH 1/4] diffcore_count_changes: pass diffcore_filespec Junio C Hamano
@ 2007-06-29 6:35 ` Junio C Hamano
2007-06-29 6:36 ` [PATCH 3/4] diffcore-delta.c: update the comment on the algorithm Junio C Hamano
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2007-06-29 6:35 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano
diffcore-break and diffcore-rename would want to behave slightly
differently depending on the binary-ness of the data, so add one
bit to the filespec, as the structure is now passed down to
diffcore_count_changes() function.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
diff.c | 16 ++++++++++++++++
diffcore.h | 1 +
2 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/diff.c b/diff.c
index 9938969..74c1198 100644
--- a/diff.c
+++ b/diff.c
@@ -3005,6 +3005,22 @@ void diffcore_std(struct diff_options *options)
{
if (options->quiet)
return;
+
+ /*
+ * break/rename count similarity differently depending on
+ * the binary-ness.
+ */
+ if ((options->break_opt != -1) || (options->detect_rename)) {
+ struct diff_queue_struct *q = &diff_queued_diff;
+ int i;
+
+ for (i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ p->one->is_binary = file_is_binary(p->one);
+ p->two->is_binary = file_is_binary(p->two);
+ }
+ }
+
if (options->break_opt != -1)
diffcore_break(options->break_opt);
if (options->detect_rename)
diff --git a/diffcore.h b/diffcore.h
index 990dec5..0c8abb5 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -37,6 +37,7 @@ struct diff_filespec {
#define DIFF_FILE_VALID(spec) (((spec)->mode) != 0)
unsigned should_free : 1; /* data should be free()'ed */
unsigned should_munmap : 1; /* data should be munmap()'ed */
+ unsigned is_binary : 1; /* data should be considered "binary" */
};
extern struct diff_filespec *alloc_filespec(const char *);
--
1.5.2.2.1414.g1e7d9
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 3/4] diffcore-delta.c: update the comment on the algorithm.
2007-06-29 6:29 ` Junio C Hamano
2007-06-29 6:35 ` [PATCH 1/4] diffcore_count_changes: pass diffcore_filespec Junio C Hamano
2007-06-29 6:35 ` [PATCH 2/4] diffcore_filespec: add is_binary Junio C Hamano
@ 2007-06-29 6:36 ` Junio C Hamano
2007-06-29 6:36 ` [PATCH 4/4] diffcore-delta.c: Ignore CR in CRLF for text files Junio C Hamano
2007-06-30 4:18 ` Applying patches in a directory that isn't a repository Geoff Russell
4 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2007-06-29 6:36 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano
The comment at the top of the file described an old algorithm
that was neutral to text/binary differences (it hashed sliding
window of N-byte sequences and counted overlaps), but long time
ago we switched to a new heuristics that are more suitable for
line oriented (read: text) files that are much faster.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
diffcore-delta.c | 21 +++++++++------------
1 files changed, 9 insertions(+), 12 deletions(-)
diff --git a/diffcore-delta.c b/diffcore-delta.c
index 0e1fae7..7116df0 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -5,23 +5,20 @@
/*
* Idea here is very simple.
*
- * We have total of (sz-N+1) N-byte overlapping sequences in buf whose
- * size is sz. If the same N-byte sequence appears in both source and
- * destination, we say the byte that starts that sequence is shared
- * between them (i.e. copied from source to destination).
+ * Almost all data we are interested in are text, but sometimes we have
+ * to deal with binary data. So we cut them into chunks delimited by
+ * LF byte, or 64-byte sequence, whichever comes first, and hash them.
*
- * For each possible N-byte sequence, if the source buffer has more
- * instances of it than the destination buffer, that means the
- * difference are the number of bytes not copied from source to
- * destination. If the counts are the same, everything was copied
- * from source to destination. If the destination has more,
- * everything was copied, and destination added more.
+ * For those chunks, if the source buffer has more instances of it
+ * than the destination buffer, that means the difference are the
+ * number of bytes not copied from source to destination. If the
+ * counts are the same, everything was copied from source to
+ * destination. If the destination has more, everything was copied,
+ * and destination added more.
*
* We are doing an approximation so we do not really have to waste
* memory by actually storing the sequence. We just hash them into
* somewhere around 2^16 hashbuckets and count the occurrences.
- *
- * The length of the sequence is arbitrarily set to 8 for now.
*/
/* Wild guess at the initial hash size */
--
1.5.2.2.1414.g1e7d9
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/4] diffcore-delta.c: Ignore CR in CRLF for text files
2007-06-29 6:29 ` Junio C Hamano
` (2 preceding siblings ...)
2007-06-29 6:36 ` [PATCH 3/4] diffcore-delta.c: update the comment on the algorithm Junio C Hamano
@ 2007-06-29 6:36 ` Junio C Hamano
2007-06-29 8:14 ` しらいしななこ
2007-06-30 4:18 ` Applying patches in a directory that isn't a repository Geoff Russell
4 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2007-06-29 6:36 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano
This ignores CR byte in CRLF sequence in text file when
computing similarity of two blobs.
Usually this should not matter as nobody sane would be checking
in a file with CRLF line endings to the repository (they would
use autocrlf so that the repository copy would have LF line
endings).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
diffcore-delta.c | 14 +++++++++++---
t/t0022-crlf-rename.sh | 33 +++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+), 3 deletions(-)
create mode 100755 t/t0022-crlf-rename.sh
diff --git a/diffcore-delta.c b/diffcore-delta.c
index 7116df0..a038b16 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -122,11 +122,14 @@ static struct spanhash_top *add_spanhash(struct spanhash_top *top,
}
}
-static struct spanhash_top *hash_chars(unsigned char *buf, unsigned int sz)
+static struct spanhash_top *hash_chars(struct diff_filespec *one)
{
int i, n;
unsigned int accum1, accum2, hashval;
struct spanhash_top *hash;
+ unsigned char *buf = one->data;
+ unsigned int sz = one->size;
+ int is_text = !one->is_binary;
i = INITIAL_HASH_SIZE;
hash = xmalloc(sizeof(*hash) + sizeof(struct spanhash) * (1<<i));
@@ -140,6 +143,11 @@ static struct spanhash_top *hash_chars(unsigned char *buf, unsigned int sz)
unsigned int c = *buf++;
unsigned int old_1 = accum1;
sz--;
+
+ /* Ignore CR in CRLF sequence if text */
+ if (is_text && c == '\r' && sz && *buf == '\n')
+ continue;
+
accum1 = (accum1 << 7) ^ (accum2 >> 25);
accum2 = (accum2 << 7) ^ (old_1 >> 25);
accum1 += c;
@@ -169,14 +177,14 @@ int diffcore_count_changes(struct diff_filespec *src,
if (src_count_p)
src_count = *src_count_p;
if (!src_count) {
- src_count = hash_chars(src->data, src->size);
+ src_count = hash_chars(src);
if (src_count_p)
*src_count_p = src_count;
}
if (dst_count_p)
dst_count = *dst_count_p;
if (!dst_count) {
- dst_count = hash_chars(dst->data, dst->size);
+ dst_count = hash_chars(dst);
if (dst_count_p)
*dst_count_p = dst_count;
}
diff --git a/t/t0022-crlf-rename.sh b/t/t0022-crlf-rename.sh
new file mode 100755
index 0000000..430a1d1
--- /dev/null
+++ b/t/t0022-crlf-rename.sh
@@ -0,0 +1,33 @@
+#!/bin/sh
+
+test_description='ignore CR in CRLF sequence while computing similiarity'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+
+ cat ../t0022-crlf-rename.sh >sample &&
+ git add sample &&
+
+ test_tick &&
+ git commit -m Initial &&
+
+ sed -e "s/\$/
/" ../t0022-crlf-rename.sh >elpmas &&
+ git add elpmas &&
+ rm -f sample &&
+
+ test_tick &&
+ git commit -a -m Second
+
+'
+
+test_expect_success 'diff -M' '
+
+ git diff-tree -M -r --name-status HEAD^ HEAD |
+ sed -e "s/R[0-9]*/RNUM/" >actual &&
+ echo "RNUM sample elpmas" >expect &&
+ diff -u expect actual
+
+'
+
+test_done
--
1.5.2.2.1414.g1e7d9
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH 4/4] diffcore-delta.c: Ignore CR in CRLF for text files
2007-06-29 6:36 ` [PATCH 4/4] diffcore-delta.c: Ignore CR in CRLF for text files Junio C Hamano
@ 2007-06-29 8:14 ` しらいしななこ
2007-06-29 8:51 ` Junio C Hamano
0 siblings, 1 reply; 9+ messages in thread
From: しらいしななこ @ 2007-06-29 8:14 UTC (permalink / raw)
To: Junio C Hamano; +Cc: GIT
Quoting Junio C Hamano <gitster@pobox.com>:
> This ignores CR byte in CRLF sequence in text file when
> computing similarity of two blobs.
> ...
> +test_expect_success 'diff -M' '
> +
> + git diff-tree -M -r --name-status HEAD^ HEAD |
> + sed -e "s/R[0-9]*/RNUM/" >actual &&
> + echo "RNUM sample elpmas" >expect &&
> + diff -u expect actual
> +
> +'
I tried this test but it does not give R100. The new file is unchanged except for
LF -> CRLF. Could you explain why?
--
Nanako Shiraishi
http://ivory.ap.teacup.com/nanako3/
----------------------------------------------------------------------
Get a free email address with REAL anti-spam protection.
http://www.bluebottle.com
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 4/4] diffcore-delta.c: Ignore CR in CRLF for text files
2007-06-29 8:14 ` しらいしななこ
@ 2007-06-29 8:51 ` Junio C Hamano
0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2007-06-29 8:51 UTC (permalink / raw)
To: しらいしななこ; +Cc: GIT
しらいしななこ <nanako3@bluebottle.com> writes:
> Quoting Junio C Hamano <gitster@pobox.com>:
>
>> This ignores CR byte in CRLF sequence in text file when
>> computing similarity of two blobs.
>> ...
>> +test_expect_success 'diff -M' '
>> +
>> + git diff-tree -M -r --name-status HEAD^ HEAD |
>> + sed -e "s/R[0-9]*/RNUM/" >actual &&
>> + echo "RNUM sample elpmas" >expect &&
>> + diff -u expect actual
>> +
>> +'
>
> I tried this test but it does not give R100. The new file is unchanged except for
> LF -> CRLF. Could you explain why?
Heh, I hate when people nitpick me ;-)
The similarity code counts "bytes copied from the source
material to the destination" and "bytes added to the source to
create the destination".
Rename detection uses only the former value. The amount of data
copied from the source is divided by the larger of the size of
source or destination.
In the case of our test script, the source material is about 560
bytes, while the destination material is about 590 bytes, after
adding CR to the end of every line. We find that 560 bytes have
been copied from the source material, and floor(560 * 100 / 590)
is 94%, which is what you would see as the result.
We could adjust max_size variable inside diffcore_count_changes,
but I am not sure if it is worth the trouble.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Applying patches in a directory that isn't a repository
2007-06-29 6:29 ` Junio C Hamano
` (3 preceding siblings ...)
2007-06-29 6:36 ` [PATCH 4/4] diffcore-delta.c: Ignore CR in CRLF for text files Junio C Hamano
@ 2007-06-30 4:18 ` Geoff Russell
4 siblings, 0 replies; 9+ messages in thread
From: Geoff Russell @ 2007-06-30 4:18 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On 6/29/07, Junio C Hamano <gitster@pobox.com> wrote:
> "Geoff Russell" <geoffrey.russell@gmail.com> writes:
>
> > But "git am" needs (AFAIK) a full repository. Is there a way to apply
> > a patch without
> > .git being present?
>
> If the recipient does not have a git repository, there is no
> point using "git am", as it is about making commits out of
> e-mails.
>
> git-apply acts as a plain "patch applicator".
For some reason I thought git-apply also needed a repository --- but it doesn't
and I've just tested it and, bingo, it is perfect for my needs. Many thanks.
Cheers,
Geoff
^ permalink raw reply [flat|nested] 9+ messages in thread