* [PATCH] Detect renames in diff family. @ 2005-05-19 10:32 Junio C Hamano 2005-05-19 16:19 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 10:32 UTC (permalink / raw) To: torvalds; +Cc: git This patch rips out the rename detection engine from diff-helper and moves it to the diff core, and updates the internal calling convention used by diff-tree family into the diff core. In order to give the same option name to diff-tree family as well as to diff-helper, I've changed the earlier diff-helper '-r' option to '-M' (stands for Move; sorry but the natural abbreviation 'r' for 'rename' is already taken for 'recursive'). Although I did a fair amount of test with the git-diff-tree with existing rename commits in the core GIT repository, this should still be considered beta (preview) release. This patch depends on the diff-delta patch I stole from Nico. Please give it a good beating. Special request for Linus is to check if I did not screw up the various calls into the diff core from diff-tree. Essentially the idea is to start one patchset session with diff_setup() and close it with diff_flush() before you start another patchset session. This implements almost everything I wanted to see in this series of patch, except a few minor cleanups in the calling convention into diff core, but that will be a separate cleanup patch. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Documentation/git-diff-cache.txt | 5 Documentation/git-diff-files.txt | 5 Documentation/git-diff-helper.txt | 4 Documentation/git-diff-tree.txt | 5 diff-cache.c | 16 + diff-files.c | 11 - diff-helper.c | 209 ++----------------- diff-tree.c | 25 +- diff.c | 410 +++++++++++++++++++++++++++++++++++--- diff.h | 19 - git-apply-patch-script | 3 t/t4001-diff-rename.sh | 60 +++++ 12 files changed, 533 insertions(+), 239 deletions(-) t/t4001-diff-rename.sh (. --> 100644) diff --git a/Documentation/git-diff-cache.txt b/Documentation/git-diff-cache.txt --- a/Documentation/git-diff-cache.txt +++ b/Documentation/git-diff-cache.txt @@ -9,7 +9,7 @@ SYNOPSIS -------- -'git-diff-cache' [-p] [-r] [-z] [-m] [--cached] <tree-ish> +'git-diff-cache' [-p] [-r] [-z] [-m] [-M] [--cached] <tree-ish> DESCRIPTION ----------- @@ -33,6 +33,9 @@ -z:: \0 line termination on output +-M:: + Detect renames; implies -p. + --cached:: do not consider the on-disk file at all diff --git a/Documentation/git-diff-files.txt b/Documentation/git-diff-files.txt --- a/Documentation/git-diff-files.txt +++ b/Documentation/git-diff-files.txt @@ -9,7 +9,7 @@ SYNOPSIS -------- -'git-diff-files' [-p] [-q] [-r] [-z] [<pattern>...] +'git-diff-files' [-p] [-q] [-r] [-z] [-M] [<pattern>...] DESCRIPTION ----------- @@ -26,6 +26,9 @@ -q:: Remain silent even on nonexisting files +-M:: + Detect renames; implies -p. + -r:: This flag does not mean anything. It is there only to match git-diff-tree. Unlike git-diff-tree, git-diff-files always looks diff --git a/Documentation/git-diff-helper.txt b/Documentation/git-diff-helper.txt --- a/Documentation/git-diff-helper.txt +++ b/Documentation/git-diff-helper.txt @@ -9,7 +9,7 @@ SYNOPSIS -------- -'git-diff-helper' [-z] [-R] [-r] +'git-diff-helper' [-z] [-R] [-M] DESCRIPTION ----------- @@ -31,7 +31,7 @@ would show a diff to bring the working file back to what is in the <tree>. --r:: +-M:: Detect renames. diff --git a/Documentation/git-diff-tree.txt b/Documentation/git-diff-tree.txt --- a/Documentation/git-diff-tree.txt +++ b/Documentation/git-diff-tree.txt @@ -9,7 +9,7 @@ SYNOPSIS -------- -'git-diff-tree' [-p] [-r] [-z] [--stdin] [-m] [-s] [-v] <tree-ish> <tree-ish> [<pattern>]\* +'git-diff-tree' [-p] [-r] [-z] [--stdin] [-M] [-m] [-s] [-v] <tree-ish> <tree-ish> [<pattern>]\* DESCRIPTION ----------- @@ -33,6 +33,9 @@ generate patch (see section on generating patches). For git-diff-tree, this flag implies '-r' as well. +-M:: + Detect renames; implies -p, in turn implying also '-r'. + -r:: recurse diff --git a/diff-cache.c b/diff-cache.c --- a/diff-cache.c +++ b/diff-cache.c @@ -5,6 +5,7 @@ static int generate_patch = 0; static int match_nonexisting = 0; static int line_termination = '\n'; +static int detect_rename = 0; /* A file entry went away or appeared */ static void show_file(const char *prefix, struct cache_entry *ce, unsigned char *sha1, unsigned int mode) @@ -165,13 +166,14 @@ } static char *diff_cache_usage = -"git-diff-cache [-p] [-r] [-z] [-m] [--cached] <tree sha1>"; +"git-diff-cache [-p] [-r] [-z] [-m] [-M] [--cached] <tree-ish>"; int main(int argc, char **argv) { unsigned char tree_sha1[20]; void *tree; unsigned long size; + int ret; read_cache(); while (argc > 2) { @@ -186,6 +188,10 @@ generate_patch = 1; continue; } + if (!strcmp(arg, "-M")) { + generate_patch = detect_rename = 1; + continue; + } if (!strcmp(arg, "-z")) { line_termination = '\0'; continue; @@ -204,6 +210,9 @@ if (argc != 2 || get_sha1(argv[1], tree_sha1)) usage(diff_cache_usage); + if (generate_patch) + diff_setup(detect_rename, 0, 0, 0, 0); + mark_merge_entries(); tree = read_object_with_reference(tree_sha1, "tree", &size, 0); @@ -212,5 +221,8 @@ if (read_tree(tree, size, 1)) die("unable to read tree object %s", argv[1]); - return diff_cache(active_cache, active_nr); + ret = diff_cache(active_cache, active_nr); + if (generate_patch) + diff_flush(); + return ret; } diff --git a/diff-files.c b/diff-files.c --- a/diff-files.c +++ b/diff-files.c @@ -7,10 +7,11 @@ #include "diff.h" static const char *diff_files_usage = -"diff-files [-p] [-q] [-r] [-z] [paths...]"; +"diff-files [-p] [-q] [-r] [-z] [-M] [paths...]"; static int generate_patch = 0; static int line_termination = '\n'; +static int detect_rename = 0; static int silent = 0; static int matches_pathspec(struct cache_entry *ce, char **spec, int cnt) @@ -79,6 +80,9 @@ ; /* no-op */ else if (!strcmp(argv[1], "-z")) line_termination = 0; + else if (!strcmp(argv[1], "-M")) { + detect_rename = generate_patch = 1; + } else usage(diff_files_usage); argv++; argc--; @@ -92,6 +96,9 @@ exit(1); } + if (generate_patch) + diff_setup(detect_rename, 0, 0, 0, 0); + for (i = 0; i < entries; i++) { struct stat st; unsigned int oldmode, mode; @@ -132,5 +139,7 @@ show_modified(oldmode, mode, ce->sha1, null_sha1, ce->name); } + if (generate_patch) + diff_flush(); return 0; } diff --git a/diff-helper.c b/diff-helper.c --- a/diff-helper.c +++ b/diff-helper.c @@ -6,160 +6,23 @@ #include "strbuf.h" #include "diff.h" -static int matches_pathspec(const char *name, const char **spec, int cnt) -{ - int i; - int namelen = strlen(name); - for (i = 0; i < cnt; i++) { - int speclen = strlen(spec[i]); - if (! strncmp(spec[i], name, speclen) && - speclen <= namelen && - (name[speclen] == 0 || - name[speclen] == '/')) - return 1; - } - return 0; -} - static int detect_rename = 0; -/* - * We do not detect circular renames. Just hold created and deleted - * entries and later attempt to match them up. If they do not match, - * then spit them out as deletes or creates as original. - */ - -static struct diff_spec_hold { - struct diff_spec_hold *next; - struct diff_spec_hold *matched; - struct diff_spec old, new; - char path[1]; -} *createdfile, *deletedfile; - -static void hold_spec(const char *path, - struct diff_spec *old, struct diff_spec *new) -{ - struct diff_spec_hold **list, *elem; - list = (! old->file_valid) ? &createdfile : &deletedfile; - elem = xmalloc(sizeof(*elem) + strlen(path)); - strcpy(elem->path, path); - elem->next = *list; - *list = elem; - elem->old = *old; - elem->new = *new; - elem->matched = 0; -} - -#define MINIMUM_SCORE 7000 -int estimate_similarity(struct diff_spec *one, struct diff_spec *two) -{ - /* Return how similar they are, representing the score as an - * integer between 0 and 10000. - * - * This version is very dumb and detects exact matches only. - * Wnen Nico's delta stuff gets in, I'll use the delta - * algorithm to estimate the similarity score in core. - */ - - if (one->sha1_valid && two->sha1_valid && - !memcmp(one->blob_sha1, two->blob_sha1, 20)) - return 10000; - return 0; -} - -static void flush_renames(const char **spec, int cnt, int reverse) -{ - struct diff_spec_hold *rename_src, *rename_dst, *elem; - struct diff_spec_hold *leftover = NULL; - int score, best_score; - - while (createdfile) { - rename_dst = createdfile; - createdfile = rename_dst->next; - best_score = MINIMUM_SCORE; - rename_src = NULL; - for (elem = deletedfile; - elem; - elem = elem->next) { - if (elem->matched) - continue; - score = estimate_similarity(&elem->old, - &rename_dst->new); - if (best_score < score) { - rename_src = elem; - best_score = score; - } - } - if (rename_src) { - rename_src->matched = rename_dst; - rename_dst->matched = rename_src; - - if (!cnt || - matches_pathspec(rename_src->path, spec, cnt) || - matches_pathspec(rename_dst->path, spec, cnt)) { - if (reverse) - run_external_diff(rename_dst->path, - rename_src->path, - &rename_dst->new, - &rename_src->old); - else - run_external_diff(rename_src->path, - rename_dst->path, - &rename_src->old, - &rename_dst->new); - } - } - else { - rename_dst->next = leftover; - leftover = rename_dst; - } - } - - /* unmatched deletes */ - for (elem = deletedfile; elem; elem = elem->next) { - if (elem->matched) - continue; - if (!cnt || - matches_pathspec(elem->path, spec, cnt)) { - if (reverse) - run_external_diff(elem->path, NULL, - &elem->new, &elem->old); - else - run_external_diff(elem->path, NULL, - &elem->old, &elem->new); - } - } - - /* unmatched creates */ - for (elem = leftover; elem; elem = elem->next) { - if (!cnt || - matches_pathspec(elem->path, spec, cnt)) { - if (reverse) - run_external_diff(elem->path, NULL, - &elem->new, &elem->old); - else - run_external_diff(elem->path, NULL, - &elem->old, &elem->new); - } - } -} - -static int parse_oneside_change(const char *cp, struct diff_spec *one, - char *path) +static int parse_oneside_change(const char *cp, int *mode, + unsigned char *sha1, char *path) { - int ch; + int ch, m; - one->file_valid = one->sha1_valid = 1; - one->mode = 0; + m = 0; while ((ch = *cp) && '0' <= ch && ch <= '7') { - one->mode = (one->mode << 3) | (ch - '0'); + m = (m << 3) | (ch - '0'); cp++; } - + *mode = m; if (strncmp(cp, "\tblob\t", 6)) return -1; cp += 6; - if (get_sha1_hex(cp, one->blob_sha1)) + if (get_sha1_hex(cp, sha1)) return -1; cp += 40; if (*cp++ != '\t') @@ -168,79 +31,63 @@ return 0; } -static int parse_diff_raw_output(const char *buf, - const char **spec, int cnt, int reverse) +static int parse_diff_raw_output(const char *buf) { - struct diff_spec old, new; char path[PATH_MAX]; + unsigned char old_sha1[20], new_sha1[20]; const char *cp = buf; - int ch; + int ch, old_mode, new_mode; switch (*cp++) { case 'U': - if (!cnt || matches_pathspec(cp + 1, spec, cnt)) - diff_unmerge(cp + 1); - return 0; + diff_unmerge(cp + 1); + break; case '+': - old.file_valid = 0; - parse_oneside_change(cp, &new, path); + parse_oneside_change(cp, &new_mode, new_sha1, path); + diff_addremove('+', new_mode, new_sha1, path, NULL); break; case '-': - new.file_valid = 0; - parse_oneside_change(cp, &old, path); + parse_oneside_change(cp, &old_mode, old_sha1, path); + diff_addremove('-', old_mode, old_sha1, path, NULL); break; case '*': - old.file_valid = old.sha1_valid = - new.file_valid = new.sha1_valid = 1; - old.mode = new.mode = 0; + old_mode = new_mode = 0; while ((ch = *cp) && ('0' <= ch && ch <= '7')) { - old.mode = (old.mode << 3) | (ch - '0'); + old_mode = (old_mode << 3) | (ch - '0'); cp++; } if (strncmp(cp, "->", 2)) return -1; cp += 2; while ((ch = *cp) && ('0' <= ch && ch <= '7')) { - new.mode = (new.mode << 3) | (ch - '0'); + new_mode = (new_mode << 3) | (ch - '0'); cp++; } if (strncmp(cp, "\tblob\t", 6)) return -1; cp += 6; - if (get_sha1_hex(cp, old.blob_sha1)) + if (get_sha1_hex(cp, old_sha1)) return -1; cp += 40; if (strncmp(cp, "->", 2)) return -1; cp += 2; - if (get_sha1_hex(cp, new.blob_sha1)) + if (get_sha1_hex(cp, new_sha1)) return -1; cp += 40; if (*cp++ != '\t') return -1; strcpy(path, cp); + diff_change(old_mode, new_mode, old_sha1, new_sha1, path, 0); break; default: return -1; } - - if (detect_rename && old.file_valid != new.file_valid) { - /* hold these */ - hold_spec(path, &old, &new); - return 0; - } - - if (!cnt || matches_pathspec(path, spec, cnt)) { - if (reverse) - run_external_diff(path, NULL, &new, &old); - else - run_external_diff(path, NULL, &old, &new); - } return 0; } static const char *diff_helper_usage = - "git-diff-helper [-r] [-R] [-z] paths..."; + "git-diff-helper [-z] [-R] [-M] paths..."; int main(int ac, const char **av) { struct strbuf sb; @@ -254,7 +101,7 @@ reverse = 1; else if (av[1][1] == 'z') line_termination = 0; - else if (av[1][1] == 'r') + else if (av[1][1] == 'M') detect_rename = 1; else usage(diff_helper_usage); @@ -262,18 +109,20 @@ } /* the remaining parameters are paths patterns */ + diff_setup(detect_rename, 0, reverse, av+1, ac-1); + while (1) { int status; read_line(&sb, stdin, line_termination); if (sb.eof) break; - status = parse_diff_raw_output(sb.buf, av+1, ac-1, reverse); + status = parse_diff_raw_output(sb.buf); if (status) { - flush_renames(av+1, ac-1, reverse); + diff_flush(); printf("%s%c", sb.buf, line_termination); } } - flush_renames(av+1, ac-1, reverse); + diff_flush(); return 0; } diff --git a/diff-tree.c b/diff-tree.c --- a/diff-tree.c +++ b/diff-tree.c @@ -9,6 +9,7 @@ static int read_stdin = 0; static int line_termination = '\n'; static int generate_patch = 0; +static int detect_rename = 0; static const char *header = NULL; static const char *header_prefix = ""; @@ -281,6 +282,18 @@ return retval; } +static int diff_tree_sha1_top(const unsigned char *old, + const unsigned char *new, const char *base) +{ + int ret; + if (generate_patch) + diff_setup(detect_rename, 0, 0, 0, 0); + ret = diff_tree_sha1(old, new, base); + if (generate_patch) + diff_flush(); + return ret; +} + static int get_one_line(const char *msg, unsigned long len) { int ret = 0; @@ -377,7 +390,7 @@ if (get_sha1_hex(buf + offset + 7, parent)) return -1; header = generate_header(name, sha1_to_hex(parent), buf, size); - diff_tree_sha1(parent, commit, ""); + diff_tree_sha1_top(parent, commit, ""); if (!header && verbose_header) header_prefix = "\ndiff-tree "; offset += 48; @@ -401,14 +414,14 @@ line[81] = 0; sprintf(this_header, "%s (from %s)\n", line, line+41); header = this_header; - return diff_tree_sha1(parent, commit, ""); + return diff_tree_sha1_top(parent, commit, ""); } line[40] = 0; return diff_tree_commit(commit, line); } static char *diff_tree_usage = -"diff-tree [-p] [-r] [-z] [--stdin] [-m] [-s] [-v] <tree sha1> <tree sha1>"; +"diff-tree [-p] [-r] [-z] [--stdin] [-M] [-m] [-s] [-v] <tree-ish> <tree-ish>"; int main(int argc, char **argv) { @@ -447,6 +460,10 @@ recursive = generate_patch = 1; continue; } + if (!strcmp(arg, "-M")) { + detect_rename = recursive = generate_patch = 1; + continue; + } if (!strcmp(arg, "-z")) { line_termination = '\0'; continue; @@ -490,7 +507,7 @@ diff_tree_commit(sha1[0], NULL); break; case 2: - diff_tree_sha1(sha1[0], sha1[1], ""); + diff_tree_sha1_top(sha1[0], sha1[1], ""); break; } diff --git a/diff.c b/diff.c --- a/diff.c +++ b/diff.c @@ -7,8 +7,10 @@ #include <limits.h> #include "cache.h" #include "diff.h" +#include "delta.h" static const char *diff_opts = "-pu"; +static unsigned char null_sha1[20] = { 0, }; static const char *external_diff(void) { @@ -79,6 +81,18 @@ char tmp_path[50]; } diff_temp[2]; +struct diff_spec { + unsigned char blob_sha1[20]; + unsigned short mode; /* file mode */ + unsigned sha1_valid : 1; /* if true, use blob_sha1 and trust mode; + * however with a NULL SHA1, read them + * from the file system. + * if false, use the name and read mode from + * the filesystem. + */ + unsigned file_valid : 1; /* if false the file does not even exist */ +}; + static void builtin_diff(const char *name_a, const char *name_b, struct diff_tempfile *temp) @@ -160,7 +174,7 @@ struct cache_entry *ce; struct stat st; int pos, len; - + /* We do not read the cache ourselves here, because the * benchmark with my previous version that always reads cache * shows that it makes things worse for diff-tree comparing @@ -214,9 +228,6 @@ struct diff_tempfile *temp, struct diff_spec *one) { - static unsigned char null_sha1[20] = { 0, }; - int use_work_tree = 0; - if (!one->file_valid) { not_a_valid_file: /* A '-' entry produces this for file-2, and @@ -228,12 +239,8 @@ return; } - if (one->sha1_valid && - (!memcmp(one->blob_sha1, null_sha1, sizeof(null_sha1)) || - work_tree_matches(name, one->blob_sha1))) - use_work_tree = 1; - - if (!one->sha1_valid || use_work_tree) { + if (!one->sha1_valid || + work_tree_matches(name, one->blob_sha1)) { struct stat st; temp->name = name; if (lstat(temp->name, &st) < 0) { @@ -295,22 +302,59 @@ remove_tempfile(); } +static int detect_rename; +static int reverse_diff; +static const char **pathspec; +static int speccnt; +static int diff_rename_minimum_score; + +static int matches_pathspec(const char *name) +{ + int i; + int namelen; + + if (speccnt == 0) + return 1; + + namelen = strlen(name); + for (i = 0; i < speccnt; i++) { + int speclen = strlen(pathspec[i]); + if (! strncmp(pathspec[i], name, speclen) && + speclen <= namelen && + (name[speclen] == 0 || name[speclen] == '/')) + return 1; + } + return 0; +} + /* An external diff command takes: * * diff-cmd name infile1 infile1-sha1 infile1-mode \ - * infile2 infile2-sha1 infile2-mode. + * infile2 infile2-sha1 infile2-mode [ rename-to ] * */ -void run_external_diff(const char *name, - const char *other, - struct diff_spec *one, - struct diff_spec *two) +static void run_external_diff(const char *name, + const char *other, + struct diff_spec *one, + struct diff_spec *two) { struct diff_tempfile *temp = diff_temp; pid_t pid; int status; static int atexit_asked = 0; + if (reverse_diff) { + struct diff_spec *tmp_spec; + tmp_spec = one; one = two; two = tmp_spec; + if (other) { + const char *tmp; + tmp = name; name = other; other = tmp; + } + } + + if (!matches_pathspec(name) && (!other || !matches_pathspec(other))) + return; + if (one && two) { prepare_temp_file(name, &temp[0], one); prepare_temp_file(other ? : name, &temp[1], two); @@ -329,14 +373,23 @@ die("unable to fork"); if (!pid) { const char *pgm = external_diff(); - /* not passing rename patch to external ones */ - if (!other && pgm) { - if (one && two) - execlp(pgm, pgm, - name, - temp[0].name, temp[0].hex, temp[0].mode, - temp[1].name, temp[1].hex, temp[1].mode, - NULL); + if (pgm) { + if (one && two) { + const char *exec_arg[9]; + const char **arg = &exec_arg[0]; + *arg++ = pgm; + *arg++ = name; + *arg++ = temp[0].name; + *arg++ = temp[0].hex; + *arg++ = temp[0].mode; + *arg++ = temp[1].name; + *arg++ = temp[1].hex; + *arg++ = temp[1].mode; + if (other) + *arg++ = other; + *arg = 0; + execvp(pgm, (char *const*) exec_arg); + } else execlp(pgm, pgm, name, NULL); } @@ -367,6 +420,293 @@ remove_tempfile(); } +/* + * We do not detect circular renames. Just hold created and deleted + * entries and later attempt to match them up. If they do not match, + * then spit them out as deletes or creates as original. + */ + +static struct diff_spec_hold { + struct diff_spec_hold *next; + struct diff_spec it; + unsigned long size; + int flags; +#define MATCHED 1 +#define SHOULD_FREE 2 +#define SHOULD_MUNMAP 4 + void *data; + char path[1]; +} *createdfile, *deletedfile; + +static void hold_diff(const char *name, + struct diff_spec *one, + struct diff_spec *two) +{ + struct diff_spec_hold **list, *elem; + + if (one->file_valid && two->file_valid) + die("internal error"); + + if (!detect_rename) { + run_external_diff(name, NULL, one, two); + return; + } + elem = xmalloc(sizeof(*elem) + strlen(name)); + strcpy(elem->path, name); + elem->size = 0; + elem->data = NULL; + elem->flags = 0; + if (one->file_valid) { + list = &deletedfile; + elem->it = *one; + } + else { + list = &createdfile; + elem->it = *two; + } + elem->next = *list; + *list = elem; +} + +static int populate_data(struct diff_spec_hold *s) +{ + char type[20]; + + if (s->data) + return 0; + if (s->it.sha1_valid) { + s->data = read_sha1_file(s->it.blob_sha1, type, &s->size); + s->flags |= SHOULD_FREE; + } + else { + struct stat st; + int fd; + fd = open(s->path, O_RDONLY); + if (fd < 0) + return -1; + if (fstat(fd, &st)) { + close(fd); + return -1; + } + s->size = st.st_size; + s->data = mmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0); + close(fd); + if (!s->size) + s->data = ""; + else + s->flags |= SHOULD_MUNMAP; + } + return 0; +} + +static void free_data(struct diff_spec_hold *s) +{ + if (s->flags & SHOULD_FREE) + free(s->data); + else if (s->flags & SHOULD_MUNMAP) + munmap(s->data, s->size); + s->flags &= ~(SHOULD_FREE|SHOULD_MUNMAP); +} + +static void flush_remaining_diff(struct diff_spec_hold *elem, + int on_created_list) +{ + static struct diff_spec null_file_spec; + + null_file_spec.file_valid = 0; + for ( ; elem ; elem = elem->next) { + free_data(elem); + if (elem->flags & MATCHED) + continue; + if (on_created_list) + run_external_diff(elem->path, NULL, + &null_file_spec, &elem->it); + else + run_external_diff(elem->path, NULL, + &elem->it, &null_file_spec); + } +} + +static int is_exact_match(struct diff_spec_hold *src, + struct diff_spec_hold *dst) +{ + if (src->it.sha1_valid && dst->it.sha1_valid && + !memcmp(src->it.blob_sha1, dst->it.blob_sha1, 20)) + return 1; + if (populate_data(src) || populate_data(dst)) + /* this is an error but will be caught downstream */ + return 0; + if (src->size == dst->size && + !memcmp(src->data, dst->data, src->size)) + return 1; + return 0; +} + +#define MINIMUM_SCORE 5000 +int estimate_similarity(struct diff_spec_hold *src, struct diff_spec_hold *dst) +{ + /* src points at a deleted file and dst points at a created + * file. They may be quite similar, in which case we want to + * say src is renamed to dst. + * + * Compare them and return how similar they are, representing + * the score as an integer between 0 and 10000. 10000 is + * reserved for the case where they match exactly. + */ + void *delta; + unsigned long delta_size; + + delta_size = ((src->size < dst->size) ? + (dst->size - src->size) : (src->size - dst->size)); + + /* We would not consider rename followed by more than + * 20% edits; that is, delta_size must be smaller than + * (src->size + dst->size)/2 * 0.2, which means... + */ + if ((src->size + dst->size) < delta_size * 10) + return 0; + + delta = diff_delta(src->data, src->size, + dst->data, dst->size, + &delta_size); + free(delta); + + /* This "delta" is really xdiff with adler32 and all the + * overheads but it is a quick and dirty approximation. + * + * Now we will give some score to it. Let's say 20% edit gets + * 5000 points and 0% edit gets 9000 points. That is, every + * 1/20000 edit gets 1 point penalty. The amount of penalty is: + * + * (delta_size * 2 / (src->size + dst->size)) * 20000 + * + */ + return 9000 - (40000 * delta_size / (src->size+dst->size)); +} + +struct diff_score { + struct diff_spec_hold *src; + struct diff_spec_hold *dst; + int score; +}; + +static int score_compare(const void *a_, const void *b_) +{ + const struct diff_score *a = a_, *b = b_; + return b->score - a->score; +} + +static void flush_rename_pair(struct diff_spec_hold *src, + struct diff_spec_hold *dst) +{ + src->flags |= MATCHED; + dst->flags |= MATCHED; + free_data(src); + free_data(dst); + run_external_diff(src->path, dst->path, + &src->it, &dst->it); +} + +static void free_held_diff(struct diff_spec_hold *list) +{ + struct diff_spec_hold *h; + for (h = list; list; list = h) { + h = list->next; + free_data(list); + free(list); + } +} + +void diff_flush(void) +{ + int num_create, num_delete, c, d; + struct diff_spec_hold *elem, *src, *dst; + struct diff_score *mx; + + /* We really want to cull the candidates list early + * with cheap tests in order to avoid doing deltas. + */ + for (dst = createdfile; dst; dst = dst->next) { + for (src = deletedfile; src; src = src->next) { + if (! is_exact_match(src, dst)) + continue; + flush_rename_pair(src, dst); + break; + } + } + + /* Count surviving candidates */ + for (num_create = 0, elem = createdfile; elem; elem = elem->next) + if (!(elem->flags & MATCHED)) + num_create++; + + for (num_delete = 0, elem = deletedfile; elem; elem = elem->next) + if (!(elem->flags & MATCHED)) + num_delete++; + + if (num_create == 0 || num_delete == 0) + goto exit_path; + + mx = xmalloc(sizeof(*mx) * num_create * num_delete); + for (c = 0, dst = createdfile; dst; dst = dst->next) { + int base = c * num_delete; + if (dst->flags & MATCHED) + continue; + for (d = 0, src = deletedfile; src; src = src->next) { + struct diff_score *m = &mx[base+d]; + if (src->flags & MATCHED) + continue; + m->src = src; + m->dst = dst; + m->score = estimate_similarity(src, dst); + d++; + } + c++; + } + qsort(mx, num_create*num_delete, sizeof(*mx), score_compare); + + for (c = 0; c < num_create * num_delete; c++) { + src = mx[c].src; + dst = mx[c].dst; + if ((src->flags & MATCHED) || (dst->flags & MATCHED)) + continue; + fprintf(stderr, + "**score ** %d %s %s\n", + mx[c].score, src->path, dst->path); + } + + for (c = 0; c < num_create * num_delete; c++) { + src = mx[c].src; + dst = mx[c].dst; + if ((src->flags & MATCHED) || (dst->flags & MATCHED)) + continue; + if (mx[c].score < diff_rename_minimum_score) + break; + flush_rename_pair(src, dst); + } + + exit_path: + flush_remaining_diff(createdfile, 1); + flush_remaining_diff(deletedfile, 0); + free_held_diff(createdfile); + free_held_diff(deletedfile); + createdfile = deletedfile = NULL; +} + +void diff_setup(int detect_rename_, int minimum_score_, int reverse_diff_, + const char **pathspec_, int speccnt_) +{ + free_held_diff(createdfile); + free_held_diff(deletedfile); + createdfile = deletedfile = NULL; + + detect_rename = detect_rename_; + reverse_diff = reverse_diff_; + pathspec = pathspec_; + speccnt = speccnt_; + diff_rename_minimum_score = minimum_score_ ? : MINIMUM_SCORE; +} + void diff_addremove(int addremove, unsigned mode, const unsigned char *sha1, const char *base, const char *path) @@ -376,7 +716,8 @@ memcpy(spec[0].blob_sha1, sha1, 20); spec[0].mode = mode; - spec[0].sha1_valid = spec[0].file_valid = 1; + spec[0].sha1_valid = !!memcmp(sha1, null_sha1, 20); + spec[0].file_valid = 1; spec[1].file_valid = 0; if (addremove == '+') { @@ -384,12 +725,12 @@ } else { one = spec; two = one + 1; } - + if (path) { strcpy(concatpath, base); strcat(concatpath, path); } - run_external_diff(path ? concatpath : base, NULL, one, two); + hold_diff(path ? concatpath : base, one, two); } void diff_change(unsigned old_mode, unsigned new_mode, @@ -399,17 +740,22 @@ char concatpath[PATH_MAX]; struct diff_spec spec[2]; + if (path) { + strcpy(concatpath, base); + strcat(concatpath, path); + } + memcpy(spec[0].blob_sha1, old_sha1, 20); spec[0].mode = old_mode; memcpy(spec[1].blob_sha1, new_sha1, 20); spec[1].mode = new_mode; - spec[0].sha1_valid = spec[0].file_valid = 1; - spec[1].sha1_valid = spec[1].file_valid = 1; + spec[0].sha1_valid = !!memcmp(old_sha1, null_sha1, 20); + spec[1].sha1_valid = !!memcmp(new_sha1, null_sha1, 20); + spec[1].file_valid = spec[0].file_valid = 1; - if (path) { - strcpy(concatpath, base); - strcat(concatpath, path); - } + /* We do not look at changed files as candidate for + * rename detection ever. + */ run_external_diff(path ? concatpath : base, NULL, &spec[0], &spec[1]); } diff --git a/diff.h b/diff.h --- a/diff.h +++ b/diff.h @@ -17,21 +17,10 @@ extern void diff_unmerge(const char *path); -/* These are for diff-helper */ +extern void diff_setup(int detect_rename, int minimum_score, + int reverse, + const char **spec, int cnt); -struct diff_spec { - unsigned char blob_sha1[20]; - unsigned short mode; /* file mode */ - unsigned sha1_valid : 1; /* if true, use blob_sha1 and trust mode; - * however with a NULL SHA1, read them - * from the file system. - * if false, use the name and read mode from - * the filesystem. - */ - unsigned file_valid : 1; /* if false the file does not even exist */ -}; - -extern void run_external_diff(const char *name, const char *other, - struct diff_spec *, struct diff_spec *); +extern void diff_flush(void); #endif /* DIFF_H */ diff --git a/git-apply-patch-script b/git-apply-patch-script --- a/git-apply-patch-script +++ b/git-apply-patch-script @@ -11,6 +11,9 @@ 1) echo >&2 "cannot handle unmerged diff on path $1." exit 1 ;; +8) + echo >&2 "cannot handle rename diff between $1 and $8 yet." + exit 1 ;; esac name="$1" tmp1="$2" hex1="$3" mode1="$4" tmp2="$5" hex2="$6" mode2="$7" diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh new file mode 100644 --- /dev/null +++ b/t/t4001-diff-rename.sh @@ -0,0 +1,60 @@ +#!/bin/sh +# +# Copyright (c) 2005 Junio C Hamano +# + +test_description='Test rename detection in diff engine. + +' +. ./test-lib.sh + +echo >path0 'Line 1 +Line 2 +Line 3 +Line 4 +Line 5 +Line 6 +Line 7 +Line 8 +Line 9 +Line 10 +line 11 +Line 12 +Line 13 +Line 14 +Line 15 +' + +test_expect_success \ + 'update-cache --add a file.' \ + 'git-update-cache --add path0' + +test_expect_success \ + 'write that tree.' \ + 'tree=$(git-write-tree)' + +sed -e 's/line/Line/' <path0 >path1 +rm -f path0 +test_expect_success \ + 'renamed and edited the file.' \ + 'git-update-cache --add --remove path0 path1' + +test_expect_success \ + 'git-diff-cache -p -M after rename and editing.' \ + 'git-diff-cache -p -M $tree >current' +cat >expected <<\EOF +diff --git a/path0 b/path1 +rename old path0 +rename new path1 +--- a/path0 ++++ b/path1 +@@ -8,7 +8,7 @@ Line 7 + Line 8 + Line 9 + Line 10 +-line 11 ++Line 11 + Line 12 + Line 13 + Line 14 +EOF ------------------------------------------------ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 10:32 [PATCH] Detect renames in diff family Junio C Hamano @ 2005-05-19 16:19 ` Linus Torvalds 2005-05-19 17:13 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Linus Torvalds @ 2005-05-19 16:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Thu, 19 May 2005, Junio C Hamano wrote: > > Special request for Linus is to check if I did not screw up the > various calls into the diff core from diff-tree. Essentially > the idea is to start one patchset session with diff_setup() and > close it with diff_flush() before you start another patchset > session. It all looks ok from a quick setup, and with this I can now do git-whatchanged -M in the kernel, and searching for renames I find: diff --git a/arch/um/kernel/sys_call_table.c b/arch/um/sys-x86_64/sys_call_table.c rename old arch/um/kernel/sys_call_table.c rename new arch/um/sys-x86_64/sys_call_table.c --- a/arch/um/kernel/sys_call_table.c +++ b/arch/um/sys-x86_64/sys_call_table.c @@ -1,4 +1,4 @@ -/* +/* * Copyright (C) 2000 Jeff Dike (jdike@karaya.com) * Copyright 2003 PathScale, Inc. * Licensed under the GPL @@ -14,6 +14,12 @@ #include "sysdep/syscalls.h" #include "kern_util.h" +#ifdef CONFIG_NFSD .... which looks quite correct. I notice that you left some debugging output in there ("**score **" stuff), and I'll remove it, but it's merged and pushed out and passed my trivial tests. [ rambling mode on: ] One thing that struck me is that there is nothing wrong with having the same old file marked twice for a rename, or considering new files to be copies of old files. So if we ever allow that, then "rename" may be the wrong name for this, since the logic certainly allows the old file to still exist (or be removed and show up multiple times in a new guise). In other words, let's say that we create a new architecture or a new filesystem, and we have tons of _new_ files, but not a lot of removed files. It would literally be very cool to see that the new files are based on contents of old files, and that it would thus potentially be very interesting to see a diff like diff --git a/arch/i386/kernel/irq.c b/arch/x86-64/kernel/irq.c based-on old arch/i386/kernel/irq.c creates new arch/x86-64/kernel/irq.c --- arch/i386/kernel/irq.c +++ arch/x86_64/kernel/irq.c @@ -1,205 +1,31 @@ /* - * linux/arch/i386/kernel/irq.c + * linux/arch/x86_64/kernel/irq.c * * Copyright (C) 1992, 1998 Linus Torvalds, Ingo Molnar * ... (the above is a made-up example, but it's at least _half-way_ valid). I'm not suggesting you actually do this, if only because it's quite expensive: it means that any newly added file would have to be compared with _all_ files in the previous archive, which is just too damn expensive. But I'd like people to kind of keep this in mind as a possibility, because maybe wasting CPU time in a big way might actually be acceptable in some cases, and having a separate flag to enable this kind of thing might be interesting, no? Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 16:19 ` Linus Torvalds @ 2005-05-19 17:13 ` Junio C Hamano 2005-05-19 17:31 ` Linus Torvalds ` (2 more replies) 2005-05-19 17:46 ` Joel Becker 2005-05-21 9:37 ` Junio C Hamano 2 siblings, 3 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 17:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> I notice that you left some debugging output in there ("**score **" LT> stuff), and I'll remove it, but it's merged and pushed out and passed my LT> trivial tests. Oops,... thanks. I still had some doubts about it and that's why I said it was beta, but that is fine. My doubts are minor: - the command line interface "-M" to read "-M" or "-M[0-9]" (one digit); -M defaults to -M5 and give the cut-off point at similarity score 5000, -M9 at 9000, etc. - I was debating myself if adding something like this was a good idea (using scale between 0 and 9 corresponding the -M[0-9] option): diff --git a/arch/um/kernel/sys_call_table.c b/arch/um/sys-x86_64/sys_call_table.c *** rename similarity index 8 rename old arch/um/kernel/sys_call_table.c rename new arch/um/sys-x86_64/sys_call_table.c --- a/arch/um/kernel/sys_call_table.c +++ b/arch/um/sys-x86_64/sys_call_table.c - I have been assuming that diff_delta uses its two input read-only but have not verified that myself yet. - I did not check for leaks and knew I had outdated comments in some while doing the diff core interface cleanups. A bit of clean-up patch, which may not apply exactly if you removed the **score** stuff is attached. Signed-off-by: Junio C Hamano <junkio@cox.net> --- # - HEAD: Detect renames in diff family. # + 11: Cleanup and leak fix after rename in diff family patch. diff --git a/diff.c b/diff.c --- a/diff.c +++ b/diff.c @@ -85,12 +85,10 @@ struct diff_spec { unsigned char blob_sha1[20]; unsigned short mode; /* file mode */ unsigned sha1_valid : 1; /* if true, use blob_sha1 and trust mode; - * however with a NULL SHA1, read them - * from the file system. - * if false, use the name and read mode from + * if false, use the name and read from * the filesystem. */ - unsigned file_valid : 1; /* if false the file does not even exist */ + unsigned file_valid : 1; /* if false the file does not exist */ }; static void builtin_diff(const char *name_a, @@ -506,6 +504,7 @@ static void free_data(struct diff_spec_h else if (s->flags & SHOULD_MUNMAP) munmap(s->data, s->size); s->flags &= ~(SHOULD_FREE|SHOULD_MUNMAP); + s->data = 0; } static void flush_remaining_diff(struct diff_spec_hold *elem, @@ -625,9 +624,17 @@ void diff_flush(void) /* We really want to cull the candidates list early * with cheap tests in order to avoid doing deltas. + * + * With the current callers, we should not have already + * matched entries at this point, but it is nonetheless + * checked for sanity. */ for (dst = createdfile; dst; dst = dst->next) { + if (dst->flags & MATCHED) + continue; for (src = deletedfile; src; src = src->next) { + if (src->flags & MATCHED) + continue; if (! is_exact_match(src, dst)) continue; flush_rename_pair(src, dst); @@ -665,6 +672,7 @@ void diff_flush(void) } qsort(mx, num_create*num_delete, sizeof(*mx), score_compare); +#if 0 for (c = 0; c < num_create * num_delete; c++) { src = mx[c].src; dst = mx[c].dst; @@ -674,6 +682,7 @@ void diff_flush(void) "**score ** %d %s %s\n", mx[c].score, src->path, dst->path); } +#endif for (c = 0; c < num_create * num_delete; c++) { src = mx[c].src; @@ -684,6 +693,7 @@ void diff_flush(void) break; flush_rename_pair(src, dst); } + free(mx); exit_path: flush_remaining_diff(createdfile, 1); ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 17:13 ` Junio C Hamano @ 2005-05-19 17:31 ` Linus Torvalds 2005-05-19 18:29 ` Nicolas Pitre 2005-05-19 18:42 ` Junio C Hamano 2 siblings, 0 replies; 43+ messages in thread From: Linus Torvalds @ 2005-05-19 17:31 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Thu, 19 May 2005, Junio C Hamano wrote: > > - I have been assuming that diff_delta uses its two input > read-only but have not verified that myself yet. Since test-delta uses mmap(PROT_READ), we'd get SIGSEGV if diff_delta actually wrote to the thing. So this is a safe assumption. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 17:13 ` Junio C Hamano 2005-05-19 17:31 ` Linus Torvalds @ 2005-05-19 18:29 ` Nicolas Pitre 2005-05-19 18:46 ` Junio C Hamano 2005-05-19 18:47 ` Junio C Hamano 2005-05-19 18:42 ` Junio C Hamano 2 siblings, 2 replies; 43+ messages in thread From: Nicolas Pitre @ 2005-05-19 18:29 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git On Thu, 19 May 2005, Junio C Hamano wrote: > - the command line interface "-M" to read "-M" or "-M[0-9]" > (one digit); -M defaults to -M5 and give the cut-off point at > similarity score 5000, -M9 at 9000, etc. Why not a fractional value instead? -M1 is 100% the same while -M.95 allows for some 5% changes. This is clear and not based on some arbitrary level values. > - I have been assuming that diff_delta uses its two input > read-only but have not verified that myself yet. It does. Nicolas ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 18:29 ` Nicolas Pitre @ 2005-05-19 18:46 ` Junio C Hamano 2005-05-19 18:58 ` Nicolas Pitre 2005-05-19 18:47 ` Junio C Hamano 1 sibling, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 18:46 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, git >>>>> "NP" == Nicolas Pitre <nico@cam.org> writes: NP> On Thu, 19 May 2005, Junio C Hamano wrote: >> - the command line interface "-M" to read "-M" or "-M[0-9]" >> (one digit); -M defaults to -M5 and give the cut-off point at >> similarity score 5000, -M9 at 9000, etc. NP> Why not a fractional value instead? -M1 is 100% the same while -M.95 NP> allows for some 5% changes. We are essentially saying the same thing. Internally diff core uses score between 0 and 10000 but single digit proposed above or fractional both hides that from the user by normalizing the scale to something less arbitrary (in my case 0-9 in your case 0-1.0). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 18:46 ` Junio C Hamano @ 2005-05-19 18:58 ` Nicolas Pitre 2005-05-19 20:36 ` Junio C Hamano 0 siblings, 1 reply; 43+ messages in thread From: Nicolas Pitre @ 2005-05-19 18:58 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git On Thu, 19 May 2005, Junio C Hamano wrote: > >>>>> "NP" == Nicolas Pitre <nico@cam.org> writes: > > NP> On Thu, 19 May 2005, Junio C Hamano wrote: > >> - the command line interface "-M" to read "-M" or "-M[0-9]" > >> (one digit); -M defaults to -M5 and give the cut-off point at > >> similarity score 5000, -M9 at 9000, etc. > > NP> Why not a fractional value instead? -M1 is 100% the same while -M.95 > NP> allows for some 5% changes. > > We are essentially saying the same thing. Internally diff core > uses score between 0 and 10000 but single digit proposed above > or fractional both hides that from the user by normalizing the > scale to something less arbitrary (in my case 0-9 in your case > 0-1.0). Yes, but 0-9 is putting a bound on the accuracy. What if someone wants not more than 2% difference? Nicolas ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 18:58 ` Nicolas Pitre @ 2005-05-19 20:36 ` Junio C Hamano 2005-05-19 20:48 ` Nicolas Pitre 0 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 20:36 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, git >>>>> "NP" == Nicolas Pitre <nico@cam.org> writes: NP> Yes, but 0-9 is putting a bound on the accuracy. What if someone wants NP> not more than 2% difference? That statement is correct, but I think you are looking at it from a developer perspective. I suspect people would not want to pay the price of having always to type many digits for the benefit of being able to specify differences of 2% and 5%. Would you also complain gzip only lets you say -1 .. -9 and not -1.63 ;-)? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 20:36 ` Junio C Hamano @ 2005-05-19 20:48 ` Nicolas Pitre 2005-05-19 21:44 ` Junio C Hamano 0 siblings, 1 reply; 43+ messages in thread From: Nicolas Pitre @ 2005-05-19 20:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git On Thu, 19 May 2005, Junio C Hamano wrote: > >>>>> "NP" == Nicolas Pitre <nico@cam.org> writes: > > NP> Yes, but 0-9 is putting a bound on the accuracy. What if someone wants > NP> not more than 2% difference? > > That statement is correct, but I think you are looking at it > from a developer perspective. > > I suspect people would not want to pay the price of having > always to type many digits for the benefit of being able to > specify differences of 2% and 5%. Would you also complain gzip > only lets you say -1 .. -9 and not -1.63 ;-)? Are we talking about git plumbing or porcelain here? Nicolas ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 20:48 ` Nicolas Pitre @ 2005-05-19 21:44 ` Junio C Hamano 2005-05-19 22:26 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 21:44 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, git >>>>> "NP" == Nicolas Pitre <nico@cam.org> writes: NP> Are we talking about git plumbing or porcelain here? We are talking about the Plumbing. Thank you for reminding me, but sometimes people end up using the bare Plumbing. As I stated before, I do not do Porcelain [*1*]; my main interest lies in helping Linus and Linux Kernel development process, by helping him in the Plumbing area and making the use of bare Plumbing layer a confortable enough experience. My ultimate goal is to make the Plumbing useful enough to make what Porcelain layers do more or less irrelevant ;-). [Footnote] *1* Yes I do have my own Porcelain layer, and personally I feel some of the things it does and some of the approaches it takes are quite good, but I do not advocate it more than necessary on this list. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 21:44 ` Junio C Hamano @ 2005-05-19 22:26 ` Linus Torvalds 2005-05-19 23:32 ` Junio C Hamano 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2005-05-19 22:26 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git On Thu, 19 May 2005, Junio C Hamano wrote: > > We are talking about the Plumbing. Thank you for reminding me, > but sometimes people end up using the bare Plumbing. There's a pretty simple and nice way to make it both useful and easy to do to arbitrary precision. Think of the number following the -M as the mantissa. So -M9 means 0.9 aka "90% match" (or "difference", depending on which way you want to go), and in general -Mx would have the "10% increments" thing. But since it's a fraction, you just give more precision by adding more numbers, and -M99 would be "99% match", while "-M02" would be "2% match" Then it would be logical for a plain -M to be 100% match / 0% difference (ie only show renames that are exact), since a "0% match" / 100% difference is nonsensical. Alternatively, we'd have -M (without any number) just default to something, and you'd give a separate number of how closely you want to mach things, ie # These all mean the same thing: (default) 20% difference git-diff-tree -M git-diff-tree -M --match=80 git-diff-tree -M --differ=20 # show only renames that are perfect matches. git-diff-tree -M --match=100 # show _everything_ as a rename, except the # matching matrix means that we prefer better # matches over worse git-diff-tree -M --match=0 Hmm? Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 22:26 ` Linus Torvalds @ 2005-05-19 23:32 ` Junio C Hamano 2005-05-20 0:10 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 23:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, git Not quite a related topic, but unless we start to support the copy detection you recommended against, my understanding is that -M does not do anything useful for git-diff-files. Is there a good use case for it? I am not advocating for its removal so please leave the command line option there. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 23:32 ` Junio C Hamano @ 2005-05-20 0:10 ` Linus Torvalds 2005-05-20 0:48 ` Junio C Hamano 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2005-05-20 0:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git On Thu, 19 May 2005, Junio C Hamano wrote: > > Not quite a related topic, but unless we start to support the > copy detection you recommended against, my understanding is that > -M does not do anything useful for git-diff-files. Is there a > good use case for it? I am not advocating for its removal so > please leave the command line option there. Hmm.. You're right, but it feels kind of wrong. Clearly we support removals in diff-files, but you're right, we can never have something show up as an addition, since we only ever compare against files that are already mentioned in the cache. On the other hand, a unmerged entry really _could_ be a new file, could it not? Even if we right now always report it as "Unmerged path"? If it's a case of existing in stage 2/3 only, it really should show up as a new file diff for git-diff-files, no? Or am I just being confused and/or stupid? Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-20 0:10 ` Linus Torvalds @ 2005-05-20 0:48 ` Junio C Hamano 0 siblings, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-20 0:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Hmm.. You're right, but it feels kind of wrong. Clearly we LT> support removals in diff-files, but you're right, we can LT> never have something show up as an addition, since we only LT> ever compare against files that are already mentioned in the LT> cache. LT> Or am I just being confused and/or stupid? No, you are not confused nor stupid. We discussed something related to this about three weeks ago. We ended up not doing what we discussed, but if we had adopted the convention of the delayed addition to the cache I described as "magic SHA1" in the quoted message below, that would have been the "intent to add" that diff-files would report as an addition. To: Linus Torvalds <torvalds@osdl.org> Cc: git@vger.kernel.org Subject: Re: [PATCH] Really fix git-merge-one-file-script this time. From: Junio C Hamano <junkio@cox.net> Date: Sun, 01 May 2005 11:38:15 -0700 Message-ID: <7vzmveu6zs.fsf@assigned-by-dhcp.cox.net> (... some discussion omitted ...) I am wondering if the following changes would make sense and make things easier for you: * git-merge-one-file-script is changed to register the path with --cacheinfo using magic SHA1 0{40} instead of using the resulting file on the filesystem. Do keep the current behaviour of leaving the merge results of trivial merges (both kind) in the work tree. * git-write-tree is changed to refuse to write from a cache that records the magic SHA1. * git-ls-files acquires a new option --merged to notice the magic SHA1 and shows the paths that have such SHA1. * git-update-cache acquires a new option --resolve to notice the magic SHA1 and: - if the named path is not in the work tree anymore, delete the entry. - if the named path exists in the work tree, compute the latest SHA1 for that file and update the entry. Changes other than the first two listed above are purely optional, since the Porcelain layer can implement them without the Plumbing support. Not doing them would keep the Plumbing somewhat cleaner by not having to know about this magic SHA1 convention. On the other hand, we already use that convention in git-diff-cache, so it may even be a consistent change to make these optional changes. Essentially, the magic SHA1 in the cache means "I know the user wants me to keep an eye on this path when it matters" [*2*]. [Footnotes] (... footnote *1* omitted ...) *2* This convention would also make an implementation of "SCM add" in the Porcelain layer a bit more efficient. A typical workflow without such a convention would consist of: * Create a file and start editing. * "SCM add" file, causing "git-update-cache --add -- file". * Do more changes, and review. * "SCM commit" which does"git-update-cache" changed files, "git-write-tree" and "git-commit-tree" to commit. which wastes one extra blob object per "SCM add". My gut feeling is that more than 80% of the time "SCM add" is followed by some edit to the added file before "SCM commit", unless it is the initial import. If we adopt that convention, "SCM add" would register with --cacheinfo with the magic SHA1 without creating the useless blob, and "SCM commit" will be written to lazily pick things up from the work tree. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 18:29 ` Nicolas Pitre 2005-05-19 18:46 ` Junio C Hamano @ 2005-05-19 18:47 ` Junio C Hamano 1 sibling, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 18:47 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, git >>>>> "NP" == Nicolas Pitre <nico@cam.org> writes: >> - I have been assuming that diff_delta uses its two input >> read-only but have not verified that myself yet. NP> It does. Thanks (also thanks to Linus for pointing out the PROT_READ in the test program). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 17:13 ` Junio C Hamano 2005-05-19 17:31 ` Linus Torvalds 2005-05-19 18:29 ` Nicolas Pitre @ 2005-05-19 18:42 ` Junio C Hamano 2005-05-19 18:55 ` Linus Torvalds 2 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 18:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Replying to myself. JCH> Oops,... thanks. I still had some doubts about it and that's JCH> why I said it was beta, but that is fine. My doubts are minor: Here is another "doubt" point. I am almost embarrassed to ask this, but what's the right way to express the following? I could not figure out how to silence const warnings from gcc without using the cast there, which defeats the whole point of const warnings: if (!pid) { const char *pgm = external_diff(); if (pgm) { if (one && two) { const char *exec_arg[9]; const char **arg = &exec_arg[0]; *arg++ = pgm; *arg++ = name; *arg++ = temp[0].name; *arg++ = temp[0].hex; *arg++ = temp[0].mode; *arg++ = temp[1].name; *arg++ = temp[1].hex; *arg++ = temp[1].mode; if (other) *arg++ = other; *arg = 0; execvp(pgm, (char *const*) exec_arg); } Here, pgm and name are const, execvp expects char *const argv[] as its second argument. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 18:42 ` Junio C Hamano @ 2005-05-19 18:55 ` Linus Torvalds 2005-05-19 19:53 ` Junio C Hamano 2005-05-19 20:12 ` H. Peter Anvin 0 siblings, 2 replies; 43+ messages in thread From: Linus Torvalds @ 2005-05-19 18:55 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Thu, 19 May 2005, Junio C Hamano wrote: > > Here is another "doubt" point. I am almost embarrassed to ask > this, but what's the right way to express the following? I > could not figure out how to silence const warnings from gcc > without using the cast there, which defeats the whole point of > const warnings: You're not doing this right. Like it or not, "execvp()" does not take a pointer to "const char *". It takes a pointer to a constant array of _non-const_ "char *". Which is not what you have. You have a non-const array of "const char *", and as a result, you need the cast. In other words, you really need char *exec_arg[9]; or, if you can indeed set the array to be const, you can make it be char *const exec_arg[9] = { pgm, name, temp[0].name, temp[0].hex, temp[0].mode, temp[1].name, temp[1].hex, temp[1].mode, } which would work, except to avoid warnings it obviously requires that none of the strings themselves are "const" (which isn't true). IOW, you're screwed. "execvp()" really should take an argument of type "const char * const *", but it doesn't for historical reasons. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 18:55 ` Linus Torvalds @ 2005-05-19 19:53 ` Junio C Hamano 2005-05-19 20:12 ` H. Peter Anvin 1 sibling, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-19 19:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> IOW, you're screwed. "execvp()" really should take an argument of type LT> "const char * const *", but it doesn't for historical reasons. That's what I suspected. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 18:55 ` Linus Torvalds 2005-05-19 19:53 ` Junio C Hamano @ 2005-05-19 20:12 ` H. Peter Anvin 1 sibling, 0 replies; 43+ messages in thread From: H. Peter Anvin @ 2005-05-19 20:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git Linus Torvalds wrote: > > IOW, you're screwed. "execvp()" really should take an argument of type > "const char * const *", but it doesn't for historical reasons. > The real problem, IMNSHO, is that C doesn't allow a pointer to a pointer to a non-const object to be implicitly treated as a pointer to a pointer to a const object. C should have required those two pointer classes to have the same representation (which they would in any sane, and pretty much any insane, system) and therefore a lot of functions could have the additional consts added to their prototypes. At least one can do casts on sane architectures... :-/ -hpa ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 16:19 ` Linus Torvalds 2005-05-19 17:13 ` Junio C Hamano @ 2005-05-19 17:46 ` Joel Becker 2005-05-21 9:37 ` Junio C Hamano 2 siblings, 0 replies; 43+ messages in thread From: Joel Becker @ 2005-05-19 17:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git On Thu, May 19, 2005 at 09:19:28AM -0700, Linus Torvalds wrote: > In other words, let's say that we create a new architecture or a new > filesystem, and we have tons of _new_ files, but not a lot of removed > files. It would literally be very cool to see that the new files are based > on contents of old files, and that it would thus potentially be very > interesting to see a diff like Subversion encourages exactly this with the 'svn cp' command. Just as knowing when a file was renamed allows you to track the history past its first appearance under the current name, 'cp' allows you to follow the history even if the original name still exists. I have found this useful more than once. Now, whether you track this up front with an expensive commit or use tools to discover the relationship at query time (ala your why-rename-tracking-isnt-needed argument) is a different question. As we all know, most tools ask the user to explicitly declare the relationship at the time it happens with 'svn rename' and 'svn cp' or the analog. But git could do the comparisons, with appropriate heuristics, at the time someone asks. Joel -- Life's Little Instruction Book #335 "Every so often, push your luck." Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Detect renames in diff family. 2005-05-19 16:19 ` Linus Torvalds 2005-05-19 17:13 ` Junio C Hamano 2005-05-19 17:46 ` Joel Becker @ 2005-05-21 9:37 ` Junio C Hamano 2005-05-21 9:39 ` [PATCH 1/3] Diff overhaul, adding half of copy detection Junio C Hamano ` (2 more replies) 2 siblings, 3 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 9:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> [ rambling mode on: ] LT> One thing that struck me is that there is nothing wrong with having the LT> same old file marked twice for a rename, or considering new files to be LT> copies of old files. So if we ever allow that, then "rename" may be the LT> wrong name for this, since the logic certainly allows the old file to LT> still exist (or be removed and show up multiple times in a new guise). People say be careful what you wish and for a reason. You may get it ;-). I am sending the following: [PATCH 1/3] Diff overhaul, adding half of copy detection. [PATCH 2/3] Introducing software archaeologist's tool "pickaxe". [PATCH 3/3] Diff overhaul, adding the other half of copy detection. ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 1/3] Diff overhaul, adding half of copy detection. 2005-05-21 9:37 ` Junio C Hamano @ 2005-05-21 9:39 ` Junio C Hamano 2005-05-21 9:40 ` [PATCH 2/3] Introducing software archaeologist's tool "pickaxe" Junio C Hamano 2005-05-21 9:42 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Junio C Hamano 2 siblings, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 9:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: git This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Documentation/git-diff-cache.txt | 5 Documentation/git-diff-files.txt | 5 Documentation/git-diff-helper.txt | 4 Documentation/git-diff-tree.txt | 6 Makefile | 5 diff-cache.c | 8 diff-files.c | 7 diff-helper.c | 25 + diff-tree.c | 8 diff.c | 703 +++++++++++++++----------------------- diffcore-rename.c | 443 +++++++++++++++++++++++ diffcore.h | 60 +++ git-apply-patch-script | 2 t/t0000-basic.sh | 2 t/t4001-diff-rename.sh | 4 t/t4003-diff-rename-1.sh | 93 +++++ 16 files changed, 946 insertions(+), 434 deletions(-) new file (100644): diffcore-rename.c new file (100644): diffcore.h new file (100755): t/t4003-diff-rename-1.sh diff --git a/Documentation/git-diff-cache.txt b/Documentation/git-diff-cache.txt --- a/Documentation/git-diff-cache.txt +++ b/Documentation/git-diff-cache.txt @@ -9,7 +9,7 @@ git-diff-cache - Compares content and mo SYNOPSIS -------- -'git-diff-cache' [-p] [-r] [-z] [-m] [-M] [-R] [--cached] <tree-ish> +'git-diff-cache' [-p] [-r] [-z] [-m] [-M] [-R] [-C] [--cached] <tree-ish> DESCRIPTION ----------- @@ -36,6 +36,9 @@ OPTIONS -M:: Detect renames; implies -p. +-C:: + Detect copies as well as renames; implies -p. + -R:: Output diff in reverse. diff --git a/Documentation/git-diff-files.txt b/Documentation/git-diff-files.txt --- a/Documentation/git-diff-files.txt +++ b/Documentation/git-diff-files.txt @@ -9,7 +9,7 @@ git-diff-files - Compares files in the w SYNOPSIS -------- -'git-diff-files' [-p] [-q] [-r] [-z] [-M] [-R] [<pattern>...] +'git-diff-files' [-p] [-q] [-r] [-z] [-M] [-C] [-R] [<pattern>...] DESCRIPTION ----------- @@ -32,6 +32,9 @@ OPTIONS -M:: Detect renames; implies -p. +-C:: + Detect copies as well as renames; implies -p. + -r:: This flag does not mean anything. It is there only to match git-diff-tree. Unlike git-diff-tree, git-diff-files always looks diff --git a/Documentation/git-diff-helper.txt b/Documentation/git-diff-helper.txt --- a/Documentation/git-diff-helper.txt +++ b/Documentation/git-diff-helper.txt @@ -9,7 +9,7 @@ git-diff-helper - Generates patch format SYNOPSIS -------- -'git-diff-helper' [-z] [-R] [-M] +'git-diff-helper' [-z] [-R] [-M] [-C] DESCRIPTION ----------- @@ -34,6 +34,8 @@ OPTIONS -M:: Detect renames. +-C:: + Detect copies as well as renames. See Also -------- diff --git a/Documentation/git-diff-tree.txt b/Documentation/git-diff-tree.txt --- a/Documentation/git-diff-tree.txt +++ b/Documentation/git-diff-tree.txt @@ -9,7 +9,7 @@ git-diff-tree - Compares the content and SYNOPSIS -------- -'git-diff-tree' [-p] [-r] [-z] [--stdin] [-M] [-R] [-m] [-s] [-v] <tree-ish> <tree-ish> [<pattern>]\* +'git-diff-tree' [-p] [-r] [-z] [--stdin] [-M] [-R] [-C] [-m] [-s] [-v] <tree-ish> <tree-ish> [<pattern>]\* DESCRIPTION ----------- @@ -36,6 +36,10 @@ OPTIONS -M:: Detect renames; implies -p, in turn implying also '-r'. +-C:: + Detect copies as well as renames; implies -p, in turn + implying also '-r'. + -R:: Output diff in reverse. diff --git a/Makefile b/Makefile --- a/Makefile +++ b/Makefile @@ -45,7 +45,7 @@ LIB_H += strbuf.h LIB_OBJS += strbuf.o LIB_H += diff.h -LIB_OBJS += diff.o +LIB_OBJS += diff.o diffcore-rename.o LIB_OBJS += gitenv.o @@ -121,9 +121,10 @@ object.o: $(LIB_H) read-cache.o: $(LIB_H) sha1_file.o: $(LIB_H) usage.o: $(LIB_H) -diff.o: $(LIB_H) strbuf.o: $(LIB_H) gitenv.o: $(LIB_H) +diff.o: $(LIB_H) +diffcore-rename.o : $(LIB_H) test: all make -C t/ all diff --git a/diff-cache.c b/diff-cache.c --- a/diff-cache.c +++ b/diff-cache.c @@ -153,7 +153,7 @@ static void mark_merge_entries(void) } static char *diff_cache_usage = -"git-diff-cache [-p] [-r] [-z] [-m] [-M] [-R] [--cached] <tree-ish>"; +"git-diff-cache [-p] [-r] [-z] [-m] [-M] [-C] [-R] [--cached] <tree-ish>"; int main(int argc, char **argv) { @@ -180,6 +180,12 @@ int main(int argc, char **argv) diff_score_opt = diff_scoreopt_parse(arg); continue; } + if (!strncmp(arg, "-C", 2)) { + generate_patch = 1; + detect_rename = 2; + diff_score_opt = diff_scoreopt_parse(arg); + continue; + } if (!strcmp(arg, "-z")) { line_termination = '\0'; continue; diff --git a/diff-files.c b/diff-files.c --- a/diff-files.c +++ b/diff-files.c @@ -7,7 +7,7 @@ #include "diff.h" static const char *diff_files_usage = -"git-diff-files [-p] [-q] [-r] [-z] [-M] [-R] [paths...]"; +"git-diff-files [-p] [-q] [-r] [-z] [-M] [-C] [-R] [paths...]"; static int generate_patch = 0; static int line_termination = '\n'; @@ -71,6 +71,11 @@ int main(int argc, char **argv) diff_score_opt = diff_scoreopt_parse(argv[1]); detect_rename = generate_patch = 1; } + else if (!strncmp(argv[1], "-C", 2)) { + diff_score_opt = diff_scoreopt_parse(argv[1]); + detect_rename = 2; + generate_patch = 1; + } else usage(diff_files_usage); argv++; argc--; diff --git a/diff-helper.c b/diff-helper.c --- a/diff-helper.c +++ b/diff-helper.c @@ -8,6 +8,7 @@ static int detect_rename = 0; static int diff_score_opt = 0; +static int generate_patch = 1; static int parse_oneside_change(const char *cp, int *mode, unsigned char *sha1, char *path) @@ -20,7 +21,8 @@ static int parse_oneside_change(const ch cp++; } *mode = m; - if (strncmp(cp, "\tblob\t", 6) && strncmp(cp, " blob ", 6)) + if (strncmp(cp, "\tblob\t", 6) && strncmp(cp, " blob ", 6) && + strncmp(cp, "\ttree\t", 6) && strncmp(cp, " tree ", 6)) return -1; cp += 6; if (get_sha1_hex(cp, sha1)) @@ -44,11 +46,13 @@ static int parse_diff_raw_output(const c diff_unmerge(cp + 1); break; case '+': - parse_oneside_change(cp, &new_mode, new_sha1, path); + if (parse_oneside_change(cp, &new_mode, new_sha1, path)) + return -1; diff_addremove('+', new_mode, new_sha1, path, NULL); break; case '-': - parse_oneside_change(cp, &old_mode, old_sha1, path); + if (parse_oneside_change(cp, &old_mode, old_sha1, path)) + return -1; diff_addremove('-', old_mode, old_sha1, path, NULL); break; case '*': @@ -64,7 +68,8 @@ static int parse_diff_raw_output(const c new_mode = (new_mode << 3) | (ch - '0'); cp++; } - if (strncmp(cp, "\tblob\t", 6) && strncmp(cp, " blob ", 6)) + if (strncmp(cp, "\tblob\t", 6) && strncmp(cp, " blob ", 6) && + strncmp(cp, "\ttree\t", 6) && strncmp(cp, " tree ", 6)) return -1; cp += 6; if (get_sha1_hex(cp, old_sha1)) @@ -88,7 +93,7 @@ static int parse_diff_raw_output(const c } static const char *diff_helper_usage = - "git-diff-helper [-z] [-R] [-M] paths..."; + "git-diff-helper [-z] [-R] [-M] [-C] paths..."; int main(int ac, const char **av) { struct strbuf sb; @@ -102,17 +107,25 @@ int main(int ac, const char **av) { reverse = 1; else if (av[1][1] == 'z') line_termination = 0; + else if (av[1][1] == 'p') /* hidden from the help */ + generate_patch = 0; else if (av[1][1] == 'M') { detect_rename = 1; diff_score_opt = diff_scoreopt_parse(av[1]); } + else if (av[1][1] == 'C') { + detect_rename = 2; + diff_score_opt = diff_scoreopt_parse(av[1]); + } else usage(diff_helper_usage); ac--; av++; } /* the remaining parameters are paths patterns */ - diff_setup(detect_rename, diff_score_opt, reverse, -1, av+1, ac-1); + diff_setup(detect_rename, diff_score_opt, reverse, + (generate_patch ? -1 : line_termination), + av+1, ac-1); while (1) { int status; diff --git a/diff-tree.c b/diff-tree.c --- a/diff-tree.c +++ b/diff-tree.c @@ -430,7 +430,7 @@ static int diff_tree_stdin(char *line) } static char *diff_tree_usage = -"git-diff-tree [-p] [-r] [-z] [--stdin] [-M] [-R] [-m] [-s] [-v] <tree-ish> <tree-ish>"; +"git-diff-tree [-p] [-r] [-z] [--stdin] [-M] [-C] [-R] [-m] [-s] [-v] <tree-ish> <tree-ish>"; int main(int argc, char **argv) { @@ -478,6 +478,12 @@ int main(int argc, char **argv) diff_score_opt = diff_scoreopt_parse(arg); continue; } + if (!strncmp(arg, "-C", 2)) { + detect_rename = 2; + recursive = generate_patch = 1; + diff_score_opt = diff_scoreopt_parse(arg); + continue; + } if (!strcmp(arg, "-z")) { line_termination = '\0'; continue; diff --git a/diff.c b/diff.c --- a/diff.c +++ b/diff.c @@ -7,12 +7,17 @@ #include <limits.h> #include "cache.h" #include "diff.h" -#include "delta.h" +#include "diffcore.h" static const char *diff_opts = "-pu"; static unsigned char null_sha1[20] = { 0, }; -#define MAX_SCORE 10000 -#define DEFAULT_MINIMUM_SCORE 5000 + +static int detect_rename; +static int reverse_diff; +static int diff_raw_output = -1; +static const char **pathspec; +static int speccnt; +static int minimum_score; static const char *external_diff(void) { @@ -77,26 +82,16 @@ static char *sq_expand(const char *src) } static struct diff_tempfile { - const char *name; + const char *name; /* filename external diff should read from */ char hex[41]; char mode[10]; char tmp_path[50]; } diff_temp[2]; -struct diff_spec { - unsigned char blob_sha1[20]; - unsigned short mode; /* file mode */ - unsigned sha1_valid : 1; /* if true, use blob_sha1 and trust mode; - * if false, use the name and read from - * the filesystem. - */ - unsigned file_valid : 1; /* if false the file does not exist */ -}; - static void builtin_diff(const char *name_a, const char *name_b, struct diff_tempfile *temp, - int rename_score) + const char *xfrm_msg) { int i, next_at, cmd_size; const char *diff_cmd = "diff -L'%s%s' -L'%s%s'"; @@ -151,14 +146,9 @@ static void builtin_diff(const char *nam printf("old mode %s\n", temp[0].mode); printf("new mode %s\n", temp[1].mode); } - if (strcmp(name_a, name_b)) { - if (0 < rename_score) - printf("rename similarity index %d%%\n", - (int)(0.5+ - rename_score*100.0/MAX_SCORE)); - printf("rename old %s\n", name_a); - printf("rename new %s\n", name_b); - } + if (xfrm_msg && xfrm_msg[0]) + fputs(xfrm_msg, stdout); + if (strncmp(temp[0].mode, temp[1].mode, 3)) /* we do not run diff between different kind * of objects. @@ -169,6 +159,28 @@ static void builtin_diff(const char *nam execlp("/bin/sh","sh", "-c", cmd, NULL); } +struct diff_filespec *alloc_filespec(const char *path) +{ + int namelen = strlen(path); + struct diff_filespec *spec = xmalloc(sizeof(*spec) + namelen + 1); + spec->path = (char *)(spec + 1); + strcpy(spec->path, path); + spec->should_free = spec->should_munmap = spec->file_valid = 0; + spec->xfrm_flags = 0; + spec->size = 0; + spec->data = 0; + return spec; +} + +void fill_filespec(struct diff_filespec *spec, const unsigned char *sha1, + unsigned short mode) +{ + spec->mode = mode; + memcpy(spec->sha1, sha1, 20); + spec->sha1_valid = !!memcmp(sha1, null_sha1, 20); + spec->file_valid = 1; +} + /* * Given a name and sha1 pair, if the dircache tells us the file in * the work tree has that object contents, return true, so that @@ -201,13 +213,86 @@ static int work_tree_matches(const char return 0; ce = active_cache[pos]; if ((lstat(name, &st) < 0) || - !S_ISREG(st.st_mode) || + !S_ISREG(st.st_mode) || /* careful! */ ce_match_stat(ce, &st) || memcmp(sha1, ce->sha1, 20)) return 0; + /* we return 1 only when we can stat, it is a regular file, + * stat information matches, and sha1 recorded in the cache + * matches. I.e. we know the file in the work tree really is + * the same as the <name, sha1> pair. + */ return 1; } +/* + * While doing rename detection and pickaxe operation, we may need to + * grab the data for the blob (or file) for our own in-core comparison. + * diff_filespec has data and size fields for this purpose. + */ +int diff_populate_filespec(struct diff_filespec *s) +{ + int err = 0; + if (!s->file_valid) + die("internal error: asking to populate invalid file."); + if (S_ISDIR(s->mode)) + return -1; + + if (s->data) + return err; + if (!s->sha1_valid || + work_tree_matches(s->path, s->sha1)) { + struct stat st; + int fd; + if (lstat(s->path, &st) < 0) { + if (errno == ENOENT) { + err_empty: + err = -1; + empty: + s->data = ""; + s->size = 0; + return err; + } + } + s->size = st.st_size; + if (!s->size) + goto empty; + if (S_ISLNK(st.st_mode)) { + int ret; + s->data = xmalloc(s->size); + s->should_free = 1; + ret = readlink(s->path, s->data, s->size); + if (ret < 0) { + free(s->data); + goto err_empty; + } + return 0; + } + fd = open(s->path, O_RDONLY); + if (fd < 0) + goto err_empty; + s->data = mmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0); + s->should_munmap = 1; + close(fd); + } + else { + char type[20]; + s->data = read_sha1_file(s->sha1, type, &s->size); + s->should_free = 1; + } + return 0; +} + +void diff_free_filespec_data(struct diff_filespec *s) +{ + if (s->should_free) + free(s->data); + else if (s->should_munmap) + munmap(s->data, s->size); + s->should_free = s->should_munmap = 0; + s->data = 0; +} + static void prep_temp_blob(struct diff_tempfile *temp, void *blob, unsigned long size, @@ -231,7 +316,7 @@ static void prep_temp_blob(struct diff_t static void prepare_temp_file(const char *name, struct diff_tempfile *temp, - struct diff_spec *one) + struct diff_filespec *one) { if (!one->file_valid) { not_a_valid_file: @@ -245,13 +330,12 @@ static void prepare_temp_file(const char } if (!one->sha1_valid || - work_tree_matches(name, one->blob_sha1)) { + work_tree_matches(name, one->sha1)) { struct stat st; - temp->name = name; - if (lstat(temp->name, &st) < 0) { + if (lstat(name, &st) < 0) { if (errno == ENOENT) goto not_a_valid_file; - die("stat(%s): %s", temp->name, strerror(errno)); + die("stat(%s): %s", name, strerror(errno)); } if (S_ISLNK(st.st_mode)) { int ret; @@ -263,31 +347,27 @@ static void prepare_temp_file(const char die("readlink(%s)", name); prep_temp_blob(temp, buf, st.st_size, (one->sha1_valid ? - one->blob_sha1 : null_sha1), + one->sha1 : null_sha1), (one->sha1_valid ? one->mode : S_IFLNK)); } else { + /* we can borrow from the file in the work tree */ + temp->name = name; if (!one->sha1_valid) strcpy(temp->hex, sha1_to_hex(null_sha1)); else - strcpy(temp->hex, sha1_to_hex(one->blob_sha1)); + strcpy(temp->hex, sha1_to_hex(one->sha1)); sprintf(temp->mode, "%06o", S_IFREG |ce_permissions(st.st_mode)); } return; } else { - void *blob; - char type[20]; - unsigned long size; - - blob = read_sha1_file(one->blob_sha1, type, &size); - if (!blob || strcmp(type, "blob")) - die("unable to read blob object for %s (%s)", - name, sha1_to_hex(one->blob_sha1)); - prep_temp_blob(temp, blob, size, one->blob_sha1, one->mode); - free(blob); + if (diff_populate_filespec(one)) + die("cannot read data blob for %s", one->path); + prep_temp_blob(temp, one->data, one->size, + one->sha1, one->mode); } } @@ -307,13 +387,6 @@ static void remove_tempfile_on_signal(in remove_tempfile(); } -static int detect_rename; -static int reverse_diff; -static int diff_raw_output = -1; -static const char **pathspec; -static int speccnt; -static int minimum_score; - static int matches_pathspec(const char *name) { int i; @@ -341,9 +414,9 @@ static int matches_pathspec(const char * */ static void run_external_diff(const char *name, const char *other, - struct diff_spec *one, - struct diff_spec *two, - int rename_score) + struct diff_filespec *one, + struct diff_filespec *two, + const char *xfrm_msg) { struct diff_tempfile *temp = diff_temp; pid_t pid; @@ -373,7 +446,7 @@ static void run_external_diff(const char const char *pgm = external_diff(); if (pgm) { if (one && two) { - const char *exec_arg[9]; + const char *exec_arg[10]; const char **arg = &exec_arg[0]; *arg++ = pgm; *arg++ = name; @@ -383,9 +456,11 @@ static void run_external_diff(const char *arg++ = temp[1].name; *arg++ = temp[1].hex; *arg++ = temp[1].mode; - if (other) + if (other) { *arg++ = other; - *arg = NULL; + *arg++ = xfrm_msg; + } + *arg = 0; execvp(pgm, (char *const*) exec_arg); } else @@ -395,7 +470,7 @@ static void run_external_diff(const char * otherwise we use the built-in one. */ if (one && two) - builtin_diff(name, other ? : name, temp, rename_score); + builtin_diff(name, other ? : name, temp, xfrm_msg); else printf("* Unmerged path %s\n", name); exit(0); @@ -418,335 +493,166 @@ static void run_external_diff(const char remove_tempfile(); } -/* - * We do not detect circular renames. Just hold created and deleted - * entries and later attempt to match them up. If they do not match, - * then spit them out as deletes or creates as original. - */ - -static struct diff_spec_hold { - struct diff_spec_hold *next; - struct diff_spec it; - unsigned long size; - int flags; -#define MATCHED 1 -#define SHOULD_FREE 2 -#define SHOULD_MUNMAP 4 - void *data; - char path[1]; -} *createdfile, *deletedfile; - -static void hold_diff(const char *name, - struct diff_spec *one, - struct diff_spec *two) -{ - struct diff_spec_hold **list, *elem; - - if (one->file_valid && two->file_valid) - die("internal error"); - - if (!detect_rename) { - run_external_diff(name, NULL, one, two, -1); - return; - } - elem = xmalloc(sizeof(*elem) + strlen(name)); - strcpy(elem->path, name); - elem->size = 0; - elem->data = NULL; - elem->flags = 0; - if (one->file_valid) { - list = &deletedfile; - elem->it = *one; - } - else { - list = &createdfile; - elem->it = *two; - } - elem->next = *list; - *list = elem; -} - -static int populate_data(struct diff_spec_hold *s) +int diff_scoreopt_parse(const char *opt) { - char type[20]; + int diglen, num, scale, i; + if (opt[0] != '-' || (opt[1] != 'M' && opt[1] != 'C')) + return -1; /* that is not a -M nor -C option */ + diglen = strspn(opt+2, "0123456789"); + if (diglen == 0 || strlen(opt+2) != diglen) + return 0; /* use default */ + sscanf(opt+2, "%d", &num); + for (i = 0, scale = 1; i < diglen; i++) + scale *= 10; - if (s->data) - return 0; - if (s->it.sha1_valid) { - s->data = read_sha1_file(s->it.blob_sha1, type, &s->size); - s->flags |= SHOULD_FREE; - } - else { - struct stat st; - int fd; - fd = open(s->path, O_RDONLY); - if (fd < 0) - return -1; - if (fstat(fd, &st)) { - close(fd); - return -1; - } - s->size = st.st_size; - s->data = mmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0); - close(fd); - if (!s->size) - s->data = ""; - else - s->flags |= SHOULD_MUNMAP; - } - return 0; + /* user says num divided by scale and we say internally that + * is MAX_SCORE * num / scale. + */ + return MAX_SCORE * num / scale; } -static void free_data(struct diff_spec_hold *s) +void diff_setup(int detect_rename_, int minimum_score_, int reverse_diff_, + int diff_raw_output_, + const char **pathspec_, int speccnt_) { - if (s->flags & SHOULD_FREE) - free(s->data); - else if (s->flags & SHOULD_MUNMAP) - munmap(s->data, s->size); - s->flags &= ~(SHOULD_FREE|SHOULD_MUNMAP); - s->data = NULL; + detect_rename = detect_rename_; + reverse_diff = reverse_diff_; + pathspec = pathspec_; + diff_raw_output = diff_raw_output_; + speccnt = speccnt_; + minimum_score = minimum_score_ ? : DEFAULT_MINIMUM_SCORE; } -static void flush_remaining_diff(struct diff_spec_hold *elem, - int on_created_list) -{ - static struct diff_spec null_file_spec; +static struct diff_queue_struct queued_diff; - null_file_spec.file_valid = 0; - for ( ; elem ; elem = elem->next) { - free_data(elem); - if (elem->flags & MATCHED) - continue; - if (on_created_list) - run_external_diff(elem->path, NULL, - &null_file_spec, &elem->it, -1); - else - run_external_diff(elem->path, NULL, - &elem->it, &null_file_spec, -1); +struct diff_file_pair *diff_queue(struct diff_queue_struct *queue, + struct diff_filespec *one, + struct diff_filespec *two) +{ + struct diff_file_pair *dp = xmalloc(sizeof(*dp)); + dp->one = one; + dp->two = two; + dp->xfrm_msg = 0; + dp->orig_order = queue->nr; + dp->xfrm_work = 0; + if (queue->alloc <= queue->nr) { + queue->alloc = alloc_nr(queue->alloc); + queue->queue = xrealloc(queue->queue, + sizeof(dp) * queue->alloc); } + queue->queue[queue->nr++] = dp; + return dp; } -static int is_exact_match(struct diff_spec_hold *src, - struct diff_spec_hold *dst) +static const char *git_object_type(unsigned mode) { - if (src->it.sha1_valid && dst->it.sha1_valid && - !memcmp(src->it.blob_sha1, dst->it.blob_sha1, 20)) - return 1; - if (populate_data(src) || populate_data(dst)) - /* this is an error but will be caught downstream */ - return 0; - if (src->size == dst->size && - !memcmp(src->data, dst->data, src->size)) - return 1; - return 0; + return S_ISDIR(mode) ? "tree" : "blob"; } -static int estimate_similarity(struct diff_spec_hold *src, struct diff_spec_hold *dst) +static void diff_flush_raw(struct diff_file_pair *p) { - /* src points at a deleted file and dst points at a created - * file. They may be quite similar, in which case we want to - * say src is renamed to dst. - * - * Compare them and return how similar they are, representing - * the score as an integer between 0 and 10000, except - * where they match exactly it is considered better than anything - * else. - */ - void *delta; - unsigned long delta_size; - int score; - - delta_size = ((src->size < dst->size) ? - (dst->size - src->size) : (src->size - dst->size)); - - /* We would not consider rename followed by more than - * minimum_score/MAX_SCORE edits; that is, delta_size must be smaller - * than (src->size + dst->size)/2 * minimum_score/MAX_SCORE, - * which means... - */ - - if ((src->size+dst->size)*minimum_score < delta_size*MAX_SCORE*2) - return 0; + struct diff_filespec *it; + int addremove; - delta = diff_delta(src->data, src->size, - dst->data, dst->size, - &delta_size); - free(delta); + /* raw output does not have a way to express rename nor copy */ + if (strcmp(p->one->path, p->two->path)) + return; - /* This "delta" is really xdiff with adler32 and all the - * overheads but it is a quick and dirty approximation. - * - * Now we will give some score to it. 100% edit gets - * 0 points and 0% edit gets MAX_SCORE points. That is, every - * 1/MAX_SCORE edit gets 1 point penalty. The amount of penalty is: - * - * (delta_size * 2 / (src->size + dst->size)) * MAX_SCORE - * - */ - score = MAX_SCORE-(MAX_SCORE*2*delta_size/(src->size+dst->size)); - if (score < 0) return 0; - if (MAX_SCORE < score) return MAX_SCORE; - return score; -} + if (p->one->file_valid && p->two->file_valid) { + char hex[41]; + strcpy(hex, sha1_to_hex(p->one->sha1)); + printf("*%06o->%06o %s %s->%s %s%c", + p->one->mode, p->two->mode, + git_object_type(p->one->mode), + hex, sha1_to_hex(p->two->sha1), + p->one->path, diff_raw_output); + return; + } -struct diff_score { - struct diff_spec_hold *src; - struct diff_spec_hold *dst; - int score; -}; + if (p->one->file_valid) { + it = p->one; + addremove = '-'; + } else { + it = p->two; + addremove = '+'; + } -static int score_compare(const void *a_, const void *b_) -{ - const struct diff_score *a = a_, *b = b_; - return b->score - a->score; + printf("%c%06o %s %s %s%c", + addremove, + it->mode, git_object_type(it->mode), + sha1_to_hex(it->sha1), it->path, diff_raw_output); } -static void flush_rename_pair(struct diff_spec_hold *src, - struct diff_spec_hold *dst, - int rename_score) +static void diff_flush_patch(struct diff_file_pair *p) { - src->flags |= MATCHED; - dst->flags |= MATCHED; - free_data(src); - free_data(dst); - run_external_diff(src->path, dst->path, - &src->it, &dst->it, rename_score); -} + const char *name, *other; -static void free_held_diff(struct diff_spec_hold *list) -{ - struct diff_spec_hold *h; - for (h = list; list; list = h) { - h = list->next; - free_data(list); - free(list); - } + name = p->one->path; + other = (strcmp(name, p->two->path) ? p->two->path : NULL); + if ((p->one->file_valid && S_ISDIR(p->one->mode)) || + (p->two->file_valid && S_ISDIR(p->two->mode))) + return; /* no tree diffs in patch format */ + + run_external_diff(name, other, p->one, p->two, p->xfrm_msg); } -void diff_flush(void) +static int identical(struct diff_filespec *one, struct diff_filespec *two) { - int num_create, num_delete, c, d; - struct diff_spec_hold *elem, *src, *dst; - struct diff_score *mx; - - /* We really want to cull the candidates list early - * with cheap tests in order to avoid doing deltas. - * - * With the current callers, we should not have already - * matched entries at this point, but it is nonetheless - * checked for sanity. + /* This function is written stricter than necessary to support + * the currently implemented transformers, but the idea is to + * let transformers to produce diff_file_pairs any way they want, + * and filter and clean them up here before producing the output. */ - for (dst = createdfile; dst; dst = dst->next) { - if (dst->flags & MATCHED) - continue; - for (src = deletedfile; src; src = src->next) { - if (src->flags & MATCHED) - continue; - if (! is_exact_match(src, dst)) - continue; - flush_rename_pair(src, dst, MAX_SCORE); - break; - } - } - - /* Count surviving candidates */ - for (num_create = 0, elem = createdfile; elem; elem = elem->next) - if (!(elem->flags & MATCHED)) - num_create++; - - for (num_delete = 0, elem = deletedfile; elem; elem = elem->next) - if (!(elem->flags & MATCHED)) - num_delete++; - - if (num_create == 0 || num_delete == 0) - goto exit_path; - - mx = xmalloc(sizeof(*mx) * num_create * num_delete); - for (c = 0, dst = createdfile; dst; dst = dst->next) { - int base = c * num_delete; - if (dst->flags & MATCHED) - continue; - for (d = 0, src = deletedfile; src; src = src->next) { - struct diff_score *m = &mx[base+d]; - if (src->flags & MATCHED) - continue; - m->src = src; - m->dst = dst; - m->score = estimate_similarity(src, dst); - d++; - } - c++; - } - qsort(mx, num_create*num_delete, sizeof(*mx), score_compare); - -#if 0 - for (c = 0; c < num_create * num_delete; c++) { - src = mx[c].src; - dst = mx[c].dst; - if ((src->flags & MATCHED) || (dst->flags & MATCHED)) - continue; - fprintf(stderr, - "**score ** %d %s %s\n", - mx[c].score, src->path, dst->path); - } -#endif - - for (c = 0; c < num_create * num_delete; c++) { - src = mx[c].src; - dst = mx[c].dst; - if ((src->flags & MATCHED) || (dst->flags & MATCHED)) - continue; - if (mx[c].score < minimum_score) - break; - flush_rename_pair(src, dst, mx[c].score); - } - free(mx); - - exit_path: - flush_remaining_diff(createdfile, 1); - flush_remaining_diff(deletedfile, 0); - free_held_diff(createdfile); - free_held_diff(deletedfile); - createdfile = deletedfile = NULL; -} -int diff_scoreopt_parse(const char *opt) -{ - int diglen, num, scale, i; - if (opt[0] != '-' || opt[1] != 'M') - return -1; /* that is not -M option */ - diglen = strspn(opt+2, "0123456789"); - if (diglen == 0 || strlen(opt+2) != diglen) - return 0; /* use default */ - sscanf(opt+2, "%d", &num); - for (i = 0, scale = 1; i < diglen; i++) - scale *= 10; + if (!one->file_valid && !two->file_valid) + return 1; /* not interesting */ - /* user says num divided by scale and we say internally that - * is MAX_SCORE * num / scale. + /* deletion, addition, mode change and renames are all interesting. */ + if ((one->file_valid != two->file_valid) || (one->mode != two->mode) || + strcmp(one->path, two->path)) + return 0; + + /* both are valid and point at the same path. that is, we are + * dealing with a change. */ - return MAX_SCORE * num / scale; + if (one->sha1_valid && two->sha1_valid && + !memcmp(one->sha1, two->sha1, sizeof(one->sha1))) + return 1; /* no change */ + if (!one->sha1_valid && !two->sha1_valid) + return 1; /* both look at the same file on the filesystem. */ + return 0; } -void diff_setup(int detect_rename_, int minimum_score_, int reverse_diff_, - int diff_raw_output_, - const char **pathspec_, int speccnt_) +static void diff_flush_one(struct diff_file_pair *p) { - free_held_diff(createdfile); - free_held_diff(deletedfile); - createdfile = deletedfile = NULL; - - detect_rename = detect_rename_; - reverse_diff = reverse_diff_; - pathspec = pathspec_; - diff_raw_output = diff_raw_output_; - speccnt = speccnt_; - minimum_score = minimum_score_ ? : DEFAULT_MINIMUM_SCORE; + if (identical(p->one, p->two)) + return; + if (0 <= diff_raw_output) + diff_flush_raw(p); + else + diff_flush_patch(p); } -static const char *git_object_type(unsigned mode) +void diff_flush(void) { - return S_ISDIR(mode) ? "tree" : "blob"; + struct diff_queue_struct *q = &queued_diff; + int i; + + if (detect_rename) + diff_detect_rename(q, detect_rename, minimum_score); + for (i = 0; i < q->nr; i++) + diff_flush_one(q->queue[i]); + + for (i = 0; i < q->nr; i++) { + struct diff_file_pair *p = q->queue[i]; + diff_free_filespec_data(p->one); + diff_free_filespec_data(p->two); + free(p->xfrm_msg); + free(p); + } + free(q->queue); + q->queue = NULL; + q->nr = q->alloc = 0; } void diff_addremove(int addremove, unsigned mode, @@ -754,41 +660,35 @@ void diff_addremove(int addremove, unsig const char *base, const char *path) { char concatpath[PATH_MAX]; - struct diff_spec spec[2], *one, *two; + struct diff_filespec *one, *two; + /* This may look odd, but it is a preparation for + * feeding "there are unchanged files which should + * not produce diffs, but when you are doing copy + * detection you would need them, so here they are" + * entries to the diff-core. They will be prefixed + * with something like '=' or '*' (I haven't decided + * which but should not make any difference). + * Feeding the same new and old to diff_change() should + * also have the same effect. diff_flush() should + * filter the identical ones out at the final output + * stage. + */ if (reverse_diff) - addremove = (addremove == '+' ? '-' : '+'); - - if (0 <= diff_raw_output) { - if (!path) - path = ""; - printf("%c%06o %s %s %s%s%c", - addremove, - mode, - git_object_type(mode), sha1_to_hex(sha1), - base, path, diff_raw_output); - return; - } - if (S_ISDIR(mode)) - return; + addremove = (addremove == '+' ? '-' : + addremove == '-' ? '+' : addremove); - memcpy(spec[0].blob_sha1, sha1, 20); - spec[0].mode = mode; - spec[0].sha1_valid = !!memcmp(sha1, null_sha1, 20); - spec[0].file_valid = 1; - spec[1].file_valid = 0; - - if (addremove == '+') { - one = spec + 1; two = spec; - } else { - one = spec; two = one + 1; - } + if (!path) path = ""; + sprintf(concatpath, "%s%s", base, path); + one = alloc_filespec(concatpath); + two = alloc_filespec(concatpath); + + if (addremove != '+') + fill_filespec(one, sha1, mode); + if (addremove != '-') + fill_filespec(two, sha1, mode); - if (path) { - strcpy(concatpath, base); - strcat(concatpath, path); - } - hold_diff(path ? concatpath : base, one, two); + diff_queue(&queued_diff, one, two); } void diff_change(unsigned old_mode, unsigned new_mode, @@ -796,7 +696,7 @@ void diff_change(unsigned old_mode, unsi const unsigned char *new_sha1, const char *base, const char *path) { char concatpath[PATH_MAX]; - struct diff_spec spec[2]; + struct diff_filespec *one, *two; if (reverse_diff) { unsigned tmp; @@ -804,41 +704,14 @@ void diff_change(unsigned old_mode, unsi tmp = old_mode; old_mode = new_mode; new_mode = tmp; tmp_c = old_sha1; old_sha1 = new_sha1; new_sha1 = tmp_c; } + if (!path) path = ""; + sprintf(concatpath, "%s%s", base, path); + one = alloc_filespec(concatpath); + two = alloc_filespec(concatpath); + fill_filespec(one, old_sha1, old_mode); + fill_filespec(two, new_sha1, new_mode); - if (0 <= diff_raw_output) { - char old_hex[41]; - strcpy(old_hex, sha1_to_hex(old_sha1)); - - if (!path) - path = ""; - printf("*%06o->%06o %s %s->%s %s%s%c", - old_mode, new_mode, - git_object_type(new_mode), - old_hex, sha1_to_hex(new_sha1), - base, path, diff_raw_output); - return; - } - if (S_ISDIR(new_mode)) - return; - - if (path) { - strcpy(concatpath, base); - strcat(concatpath, path); - } - - memcpy(spec[0].blob_sha1, old_sha1, 20); - spec[0].mode = old_mode; - memcpy(spec[1].blob_sha1, new_sha1, 20); - spec[1].mode = new_mode; - spec[0].sha1_valid = !!memcmp(old_sha1, null_sha1, 20); - spec[1].sha1_valid = !!memcmp(new_sha1, null_sha1, 20); - spec[1].file_valid = spec[0].file_valid = 1; - - /* We do not look at changed files as candidate for - * rename detection ever. - */ - run_external_diff(path ? concatpath : base, NULL, - &spec[0], &spec[1], -1); + diff_queue(&queued_diff, one, two); } void diff_unmerge(const char *path) @@ -847,5 +720,5 @@ void diff_unmerge(const char *path) printf("U %s%c", path, diff_raw_output); return; } - run_external_diff(path, NULL, NULL, NULL, -1); + run_external_diff(path, NULL, NULL, NULL, NULL); } diff --git a/diffcore-rename.c b/diffcore-rename.c new file mode 100644 --- /dev/null +++ b/diffcore-rename.c @@ -0,0 +1,443 @@ +/* + * Copyright (C) 2005 Junio C Hamano + */ +#include "cache.h" +#include "diff.h" +#include "diffcore.h" +#include "delta.h" + +struct diff_rename_pool { + struct diff_filespec **s; + int nr, alloc; +}; + +static void diff_rename_pool_clear(struct diff_rename_pool *pool) +{ + pool->s = NULL; pool->nr = pool->alloc = 0; +} + +static void diff_rename_pool_add(struct diff_rename_pool *pool, + struct diff_filespec *s) +{ + if (S_ISDIR(s->mode)) + return; /* rename/copy patch for tree does not make sense. */ + + if (pool->alloc <= pool->nr) { + pool->alloc = alloc_nr(pool->alloc); + pool->s = xrealloc(pool->s, + sizeof(*(pool->s)) * pool->alloc); + } + pool->s[pool->nr] = s; + pool->nr++; +} + +static int is_exact_match(struct diff_filespec *src, struct diff_filespec *dst) +{ + if (src->sha1_valid && dst->sha1_valid && + !memcmp(src->sha1, dst->sha1, 20)) + return 1; + if (diff_populate_filespec(src) || diff_populate_filespec(dst)) + /* this is an error but will be caught downstream */ + return 0; + if (src->size == dst->size && + !memcmp(src->data, dst->data, src->size)) + return 1; + return 0; +} + +struct diff_score { + struct diff_filespec *src; + struct diff_filespec *dst; + int score; + int rank; +}; + +static int estimate_similarity(struct diff_filespec *src, + struct diff_filespec *dst, + int minimum_score) +{ + /* src points at a file that existed in the original tree (or + * optionally a file in the destination tree) and dst points + * at a newly created file. They may be quite similar, in which + * case we want to say src is renamed to dst or src is copied into + * dst, and then some edit has been applied to dst. + * + * Compare them and return how similar they are, representing + * the score as an integer between 0 and 10000, except + * where they match exactly it is considered better than anything + * else. + */ + void *delta; + unsigned long delta_size; + int score; + + delta_size = ((src->size < dst->size) ? + (dst->size - src->size) : (src->size - dst->size)); + + /* We would not consider rename followed by more than + * minimum_score/MAX_SCORE edits; that is, delta_size must be smaller + * than (src->size + dst->size)/2 * minimum_score/MAX_SCORE, + * which means... + */ + + if ((src->size+dst->size)*minimum_score < delta_size*MAX_SCORE*2) + return 0; + + delta = diff_delta(src->data, src->size, + dst->data, dst->size, + &delta_size); + free(delta); + + /* This "delta" is really xdiff with adler32 and all the + * overheads but it is a quick and dirty approximation. + * + * Now we will give some score to it. 100% edit gets + * 0 points and 0% edit gets MAX_SCORE points. That is, every + * 1/MAX_SCORE edit gets 1 point penalty. The amount of penalty is: + * + * (delta_size * 2 / (src->size + dst->size)) * MAX_SCORE + * + */ + score = MAX_SCORE-(MAX_SCORE*2*delta_size/(src->size+dst->size)); + if (score < 0) return 0; + if (MAX_SCORE < score) return MAX_SCORE; + return score; +} + +static void record_rename_pair(struct diff_queue_struct *outq, + struct diff_filespec *src, + struct diff_filespec *dst, + int rank, + int score) +{ + /* The rank is used to sort the final output, because there + * are certain dependencies. + * + * - rank #0 depends on deleted ones. + * - rank #1 depends on kept files before they are modified. + * - rank #2 depends on kept files after they are modified; + * currently not used. + * + * Therefore, the final output order should be: + * + * 1. rank #0 rename/copy diffs. + * 2. deletions in the original. + * 3. rank #1 rename/copy diffs. + * 4. additions and modifications in the original. + * 5. rank #2 rename/copy diffs; currently not used. + * + * To achieve this sort order, we give xform_work the number + * above. + */ + struct diff_file_pair *dp = diff_queue(outq, src, dst); + dp->xfrm_work = (rank * 2 + 1) | (score<<RENAME_SCORE_SHIFT); + dst->xfrm_flags |= RENAME_DST_MATCHED; +} + +#if 0 +static void debug_filespec(struct diff_filespec *s, int x, const char *one) +{ + fprintf(stderr, "queue[%d] %s (%s) %s %06o %s\n", + x, one, + s->path, + s->file_valid ? "valid" : "invalid", + s->mode, + s->sha1_valid ? sha1_to_hex(s->sha1) : ""); + fprintf(stderr, "queue[%d] %s size %lu flags %d\n", + x, one, + s->size, s->xfrm_flags); +} + +static void debug_filepair(const struct diff_file_pair *p, int i) +{ + debug_filespec(p->one, i, "one"); + debug_filespec(p->two, i, "two"); + fprintf(stderr, "pair flags %d, orig order %d, score %d\n", + (p->xfrm_work & ((1<<RENAME_SCORE_SHIFT) - 1)), + p->orig_order, + (p->xfrm_work >> RENAME_SCORE_SHIFT)); +} + +static void debug_queue(const char *msg, struct diff_queue_struct *q) +{ + int i; + if (msg) + fprintf(stderr, "%s\n", msg); + fprintf(stderr, "q->nr = %d\n", q->nr); + for (i = 0; i < q->nr; i++) { + struct diff_file_pair *p = q->queue[i]; + debug_filepair(p, i); + } +} +#else +#define debug_queue(a,b) do { ; /*nothing*/ } while(0) +#endif + +/* + * We sort the outstanding diff entries according to the rank (see + * comment at the beginning of record_rename_pair) and tiebreak with + * the order in the original input. + */ +static int rank_compare(const void *a_, const void *b_) +{ + const struct diff_file_pair *a = *(const struct diff_file_pair **)a_; + const struct diff_file_pair *b = *(const struct diff_file_pair **)b_; + int a_rank = a->xfrm_work & ((1<<RENAME_SCORE_SHIFT) - 1); + int b_rank = b->xfrm_work & ((1<<RENAME_SCORE_SHIFT) - 1); + + if (a_rank != b_rank) + return a_rank - b_rank; + return a->orig_order - b->orig_order; +} + +/* + * We sort the rename similarity matrix with the score, in descending + * order (more similar first). + */ +static int score_compare(const void *a_, const void *b_) +{ + const struct diff_score *a = a_, *b = b_; + return b->score - a->score; +} + +static int needs_to_stay(struct diff_queue_struct *q, int i, + struct diff_filespec *it) +{ + /* If it will be used in later entry (either stay or used + * as the source of rename/copy), we need to copy, not rename. + */ + while (i < q->nr) { + struct diff_file_pair *p = q->queue[i++]; + if (!p->two->file_valid) + continue; /* removed is fine */ + if (strcmp(p->one->path, it->path)) + continue; /* not relevant */ + + /* p has its src set to *it and it is not a delete; + * it will be used for in-place change or rename/copy, + * so we cannot rename it out. + */ + return 1; + } + return 0; +} + +void diff_detect_rename(struct diff_queue_struct *q, + int detect_rename, + int minimum_score) +{ + struct diff_queue_struct outq; + struct diff_rename_pool created, deleted, stay; + struct diff_rename_pool *(srcs[2]); + struct diff_score *mx; + int h, i, j; + int num_create, num_src, dst_cnt, src_cnt; + + outq.queue = NULL; + outq.nr = outq.alloc = 0; + + diff_rename_pool_clear(&created); + diff_rename_pool_clear(&deleted); + diff_rename_pool_clear(&stay); + + srcs[0] = &deleted; + srcs[1] = &stay; + + /* NEEDSWORK: + * (1) make sure we properly ignore but pass trees. + * + * (2) make sure we do right thing on the same path deleted + * and created in the same patch. + */ + + for (i = 0; i < q->nr; i++) { + struct diff_file_pair *p = q->queue[i]; + if (!p->one->file_valid) + if (!p->two->file_valid) + continue; /* ignore nonsense */ + else + diff_rename_pool_add(&created, p->two); + else if (!p->two->file_valid) + diff_rename_pool_add(&deleted, p->one); + else if (1 < detect_rename) /* find copy, too */ + diff_rename_pool_add(&stay, p->one); + } + if (created.nr == 0) + goto cleanup; /* nothing to do */ + + /* We really want to cull the candidates list early + * with cheap tests in order to avoid doing deltas. + * + * With the current callers, we should not have already + * matched entries at this point, but it is nonetheless + * checked for sanity. + */ + for (i = 0; i < created.nr; i++) { + if (created.s[i]->xfrm_flags & RENAME_DST_MATCHED) + continue; /* we have matched exactly already */ + for (h = 0; h < sizeof(srcs)/sizeof(srcs[0]); h++) { + struct diff_rename_pool *p = srcs[h]; + for (j = 0; j < p->nr; j++) { + if (!is_exact_match(p->s[j], created.s[i])) + continue; + record_rename_pair(&outq, + p->s[j], created.s[i], h, + MAX_SCORE); + break; /* we are done with this entry */ + } + } + } + debug_queue("done detecting exact", &outq); + + /* Have we run out the created file pool? If so we can avoid + * doing the delta matrix altogether. + */ + if (outq.nr == created.nr) + goto flush_rest; + + num_create = (created.nr - outq.nr); + num_src = deleted.nr + stay.nr; + mx = xmalloc(sizeof(*mx) * num_create * num_src); + for (dst_cnt = i = 0; i < created.nr; i++) { + int base = dst_cnt * num_src; + if (created.s[i]->xfrm_flags & RENAME_DST_MATCHED) + continue; /* dealt with exact match already. */ + for (src_cnt = h = 0; h < sizeof(srcs)/sizeof(srcs[0]); h++) { + struct diff_rename_pool *p = srcs[h]; + for (j = 0; j < p->nr; j++, src_cnt++) { + struct diff_score *m = &mx[base + src_cnt]; + m->src = p->s[j]; + m->dst = created.s[i]; + m->score = estimate_similarity(m->src, m->dst, + minimum_score); + m->rank = h; + } + } + dst_cnt++; + } + /* cost matrix sorted by most to least similar pair */ + qsort(mx, num_create * num_src, sizeof(*mx), score_compare); + for (i = 0; i < num_create * num_src; i++) { + if (mx[i].dst->xfrm_flags & RENAME_DST_MATCHED) + continue; /* alreayd done, either exact or fuzzy. */ + if (mx[i].score < minimum_score) + continue; + record_rename_pair(&outq, + mx[i].src, mx[i].dst, mx[i].rank, + mx[i].score); + } + free(mx); + debug_queue("done detecting fuzzy", &outq); + + flush_rest: + /* At this point, we have found some renames and copies and they + * are kept in outq. The original list is still in *q. + * + * Scan the original list and move them into the outq; we will sort + * outq and swap it into the queue supplied to pass that to + * downstream, so we assign the sort keys in this loop. + * + * See comments at the top of record_rename_pair for numbers used + * to assign xfrm_work. + * + * Note that we have not annotated the diff_file_pair with any comment + * so there is nothing other than p to free. + */ + for (i = 0; i < q->nr; i++) { + struct diff_file_pair *dp, *p = q->queue[i]; + if (!p->one->file_valid) { + if (p->two->file_valid) { + /* creation */ + dp = diff_queue(&outq, p->one, p->two); + dp->xfrm_work = 4; + } + /* otherwise it is a nonsense; just ignore it */ + } + else if (!p->two->file_valid) { + /* deletion */ + dp = diff_queue(&outq, p->one, p->two); + dp->xfrm_work = 2; + } + else { + /* modification, or stay as is */ + dp = diff_queue(&outq, p->one, p->two); + dp->xfrm_work = 4; + } + free(p); + } + debug_queue("done copying original", &outq); + + /* Sort outq */ + qsort(outq.queue, outq.nr, sizeof(outq.queue[0]), rank_compare); + + debug_queue("done sorting", &outq); + + free(q->queue); + q->nr = q->alloc = 0; + q->queue = NULL; + + /* Copy it out to q, removing duplicates. */ + for (i = 0; i < outq.nr; i++) { + struct diff_file_pair *p = outq.queue[i]; + if (!p->one->file_valid) { + /* created */ + if (p->two->xfrm_flags & RENAME_DST_MATCHED) + ; /* rename/copy created it already */ + else + diff_queue(q, p->one, p->two); + } + else if (!p->two->file_valid) { + /* deleted */ + if (p->one->xfrm_flags & RENAME_SRC_GONE) + ; /* rename/copy deleted it already */ + else + diff_queue(q, p->one, p->two); + } + else if (strcmp(p->one->path, p->two->path)) { + /* rename or copy */ + struct diff_file_pair *dp = + diff_queue(q, p->one, p->two); + int msglen = (strlen(p->one->path) + + strlen(p->two->path) + 100); + int score = (p->xfrm_work >> RENAME_SCORE_SHIFT); + dp->xfrm_msg = xmalloc(msglen); + + /* if we have a later entry that is a rename/copy + * that depends on p->one, then we copy here. + * otherwise we rename it. + */ + if (needs_to_stay(&outq, i+1, p->one)) { + /* copy it */ + sprintf(dp->xfrm_msg, + "similarity index %d%%\n" + "copy from %s\n" + "copy to %s\n", + (int)(0.5 + score * 100 / MAX_SCORE), + p->one->path, p->two->path); + } + else { + /* rename it, and mark it as gone. */ + p->one->xfrm_flags |= RENAME_SRC_GONE; + sprintf(dp->xfrm_msg, + "similarity index %d%%\n" + "rename old %s\n" + "rename new %s\n", + (int)(0.5 + score * 100 / MAX_SCORE), + p->one->path, p->two->path); + } + } + else + /* otherwise it is a modified (or stayed) entry */ + diff_queue(q, p->one, p->two); + free(p); + } + + free(outq.queue); + debug_queue("done collapsing", q); + + cleanup: + free(created.s); + free(deleted.s); + free(stay.s); + return; +} diff --git a/diffcore.h b/diffcore.h new file mode 100644 --- /dev/null +++ b/diffcore.h @@ -0,0 +1,60 @@ +/* + * Copyright (C) 2005 Junio C Hamano + */ +#ifndef _DIFFCORE_H_ +#define _DIFFCORE_H_ + +/* This header file is internal between diff.c and its diff transformers + * (e.g. diffcore-rename, diffcore-pickaxe). Never include this header + * in anything else. + */ +#define MAX_SCORE 10000 +#define DEFAULT_MINIMUM_SCORE 5000 + +#define RENAME_DST_MATCHED 01 +#define RENAME_SRC_GONE 02 +#define RENAME_SCORE_SHIFT 8 + +struct diff_filespec { + unsigned char sha1[20]; + char *path; + void *data; + unsigned long size; + int xfrm_flags; /* for use by the xfrm */ + unsigned short mode; /* file mode */ + unsigned sha1_valid : 1; /* if true, use sha1 and trust mode; + * if false, use the name and read from + * the filesystem. + */ + unsigned file_valid : 1; /* if false the file does not exist */ + unsigned should_free : 1; /* data should be free()'ed */ + unsigned should_munmap : 1; /* data should be munmap()'ed */ +}; + +extern struct diff_filespec *alloc_filespec(const char *); +extern void fill_filespec(struct diff_filespec *, const unsigned char *, + unsigned short); + +extern int diff_populate_filespec(struct diff_filespec *); +extern void diff_free_filespec_data(struct diff_filespec *); + +struct diff_file_pair { + struct diff_filespec *one; + struct diff_filespec *two; + char *xfrm_msg; + int orig_order; /* the original order of insertion into the queue */ + int xfrm_work; /* for use by tramsformers, not by diffcore */ +}; + +struct diff_queue_struct { + struct diff_file_pair **queue; + int alloc; + int nr; +}; + +extern struct diff_file_pair *diff_queue(struct diff_queue_struct *, + struct diff_filespec *, + struct diff_filespec *); +extern void diff_detect_rename(struct diff_queue_struct *, int, int); + +#endif diff --git a/git-apply-patch-script b/git-apply-patch-script --- a/git-apply-patch-script +++ b/git-apply-patch-script @@ -11,7 +11,7 @@ case "$#" in 1) echo >&2 "cannot handle unmerged diff on path $1." exit 1 ;; -8) +8 | 9) echo >&2 "cannot handle rename diff between $1 and $8 yet." exit 1 ;; esac diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh --- a/t/t0000-basic.sh +++ b/t/t0000-basic.sh @@ -32,7 +32,7 @@ test_expect_success \ find .git/objects -type d -print >full-of-directories test_expect_success \ '.git/objects should have 256 subdirectories.' \ - 'test "$(wc -l full-of-directories | sed -e "s/ .*//")" = 257' + 'test $(cat full-of-directories | wc -l) = 257' ################################################################ # Basics of the basics diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -31,7 +31,7 @@ test_expect_success \ test_expect_success \ 'write that tree.' \ - 'tree=$(git-write-tree)' + 'tree=$(git-write-tree) && echo $tree' sed -e 's/line/Line/' <path0 >path1 rm -f path0 @@ -61,6 +61,6 @@ EOF test_expect_success \ 'validate the output.' \ - 'diff -I "rename similarity.*" >/dev/null current expected' + 'diff -I "similarity.*" >/dev/null current expected' test_done diff --git a/t/t4003-diff-rename-1.sh b/t/t4003-diff-rename-1.sh new file mode 100755 --- /dev/null +++ b/t/t4003-diff-rename-1.sh @@ -0,0 +1,93 @@ +#!/bin/sh +# +# Copyright (c) 2005 Junio C Hamano +# + +test_description='More rename detection + +' +. ./test-lib.sh + +test_expect_success \ + 'prepare reference tree' \ + 'cat ../../COPYING >COPYING && + git-update-cache --add COPYING && + tree=$(git-write-tree) && + echo $tree' + +test_expect_success \ + 'prepare work tree' \ + 'sed -e 's/HOWEVER/However/' <COPYING >COPYING.1 && + sed -e 's/GPL/G.P.L/g' <COPYING >COPYING.2 && + rm -f COPYING && + git-update-cache --add --remove COPYING COPYING.?' + +GIT_DIFF_OPTS=-u0 git-diff-cache -M $tree | +sed -e 's/\([0-9][0-9]*\)/#/g' >current && +cat >expected <<\EOF +diff --git a/COPYING b/COPYING.# +similarity index #% +copy from COPYING +copy to COPYING.# +--- a/COPYING ++++ b/COPYING.# +@@ -# +# @@ +- HOWEVER, in order to allow a migration to GPLv# if that seems like ++ However, in order to allow a migration to GPLv# if that seems like +diff --git a/COPYING b/COPYING.# +similarity index #% +rename old COPYING +rename new COPYING.# +--- a/COPYING ++++ b/COPYING.# +@@ -# +# @@ +- Note that the only valid version of the GPL as far as this project ++ Note that the only valid version of the G.P.L as far as this project +@@ -# +# @@ +- HOWEVER, in order to allow a migration to GPLv# if that seems like ++ HOWEVER, in order to allow a migration to G.P.Lv# if that seems like +@@ -# +# @@ +- This file is licensed under the GPL v#, or a later version ++ This file is licensed under the G.P.L v#, or a later version +EOF + +test_expect_success \ + 'validate output from rename/copy detection' \ + 'diff -u current expected' + +test_expect_success \ + 'prepare work tree again' \ + 'mv COPYING.2 COPYING && + git-update-cache --add --remove COPYING COPYING.1' + +GIT_DIFF_OPTS=-u0 git-diff-cache -C $tree | +sed -e 's/\([0-9][0-9]*\)/#/g' >current +cat >expected <<\EOF +diff --git a/COPYING b/COPYING.# +similarity index #% +copy from COPYING +copy to COPYING.# +--- a/COPYING ++++ b/COPYING.# +@@ -# +# @@ +- HOWEVER, in order to allow a migration to GPLv# if that seems like ++ However, in order to allow a migration to GPLv# if that seems like +diff --git a/COPYING b/COPYING +--- a/COPYING ++++ b/COPYING +@@ -# +# @@ +- Note that the only valid version of the GPL as far as this project ++ Note that the only valid version of the G.P.L as far as this project +@@ -# +# @@ +- HOWEVER, in order to allow a migration to GPLv# if that seems like ++ HOWEVER, in order to allow a migration to G.P.Lv# if that seems like +@@ -# +# @@ +- This file is licensed under the GPL v#, or a later version ++ This file is licensed under the G.P.L v#, or a later version +EOF + +test_expect_success \ + 'validate output from rename/copy detection' \ + 'diff -u current expected' + +test_done ------------------------------------------------ ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 2/3] Introducing software archaeologist's tool "pickaxe". 2005-05-21 9:37 ` Junio C Hamano 2005-05-21 9:39 ` [PATCH 1/3] Diff overhaul, adding half of copy detection Junio C Hamano @ 2005-05-21 9:40 ` Junio C Hamano 2005-05-21 22:02 ` [PATCH] Constness fix for pickaxe option Junio C Hamano 2005-05-21 9:42 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Junio C Hamano 2 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 9:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: git This steals the "pickaxe" feature from JIT and make it available to the bare Plumbing layer. From the command line, the user gives a string he is intersted in. Using the diff-core infrastructure previously introduced, it filters the differences to limit the output only to the diffs between <src> and <dst> where the string appears only in one but not in the other. For example: $ ./git-rev-list HEAD | ./git-diff-tree -Sdiff-tree-helper --stdin -M would show the diffs that touch the string "diff-tree-helper". In real software-archaeologist application, you would typically look for a few to several lines of code and see where that code came from. The "pickaxe" module runs after "rename/copy detection" module, so it even crosses the file rename boundary, as the above example demonstrates. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Documentation/git-diff-cache.txt | 6 +++- Documentation/git-diff-files.txt | 6 +++- Documentation/git-diff-helper.txt | 6 +++- Documentation/git-diff-tree.txt | 5 ++- Makefile | 3 +- diff-cache.c | 11 +++++-- diff-files.c | 9 ++++-- diff-helper.c | 10 ++++-- diff-tree.c | 15 ++++++---- diff.c | 23 +++++++++------ diff.h | 1 diffcore-pickaxe.c | 56 ++++++++++++++++++++++++++++++++++++++ diffcore-rename.c | 29 +++++++------------ diffcore.h | 11 ++++--- 14 files changed, 140 insertions(+), 51 deletions(-) new file (100644): diffcore-pickaxe.c diff --git a/Documentation/git-diff-cache.txt b/Documentation/git-diff-cache.txt --- a/Documentation/git-diff-cache.txt +++ b/Documentation/git-diff-cache.txt @@ -9,7 +9,7 @@ git-diff-cache - Compares content and mo SYNOPSIS -------- -'git-diff-cache' [-p] [-r] [-z] [-m] [-M] [-R] [-C] [--cached] <tree-ish> +'git-diff-cache' [-p] [-r] [-z] [-m] [-M] [-R] [-C] [-S<string>] [--cached] <tree-ish> DESCRIPTION ----------- @@ -39,6 +39,10 @@ OPTIONS -C:: Detect copies as well as renames; implies -p. +-S<string>:: + Look for differences that contains the change in <string>. + + -R:: Output diff in reverse. diff --git a/Documentation/git-diff-files.txt b/Documentation/git-diff-files.txt --- a/Documentation/git-diff-files.txt +++ b/Documentation/git-diff-files.txt @@ -9,7 +9,7 @@ git-diff-files - Compares files in the w SYNOPSIS -------- -'git-diff-files' [-p] [-q] [-r] [-z] [-M] [-C] [-R] [<pattern>...] +'git-diff-files' [-p] [-q] [-r] [-z] [-M] [-C] [-R] [-S<string>] [<pattern>...] DESCRIPTION ----------- @@ -35,6 +35,10 @@ OPTIONS -C:: Detect copies as well as renames; implies -p. +-S<string>:: + Look for differences that contains the change in <string>. + + -r:: This flag does not mean anything. It is there only to match git-diff-tree. Unlike git-diff-tree, git-diff-files always looks diff --git a/Documentation/git-diff-helper.txt b/Documentation/git-diff-helper.txt --- a/Documentation/git-diff-helper.txt +++ b/Documentation/git-diff-helper.txt @@ -9,7 +9,7 @@ git-diff-helper - Generates patch format SYNOPSIS -------- -'git-diff-helper' [-z] [-R] [-M] [-C] +'git-diff-helper' [-z] [-R] [-M] [-C] [-S<string>] DESCRIPTION ----------- @@ -37,6 +37,10 @@ OPTIONS -C:: Detect copies as well as renames. +-S<string>:: + Look for differences that contains the change in <string>. + + See Also -------- The section on generating patches in link:git-diff-cache.html[git-diff-cache] diff --git a/Documentation/git-diff-tree.txt b/Documentation/git-diff-tree.txt --- a/Documentation/git-diff-tree.txt +++ b/Documentation/git-diff-tree.txt @@ -9,7 +9,7 @@ git-diff-tree - Compares the content and SYNOPSIS -------- -'git-diff-tree' [-p] [-r] [-z] [--stdin] [-M] [-R] [-C] [-m] [-s] [-v] <tree-ish> <tree-ish> [<pattern>]\* +'git-diff-tree' [-p] [-r] [-z] [--stdin] [-M] [-R] [-C] [-S<string>] [-m] [-s] [-v] <tree-ish> <tree-ish> [<pattern>]\* DESCRIPTION ----------- @@ -43,6 +43,9 @@ OPTIONS -R:: Output diff in reverse. +-S<string>:: + Look for differences that contains the change in <string>. + -r:: recurse diff --git a/Makefile b/Makefile --- a/Makefile +++ b/Makefile @@ -45,7 +45,7 @@ LIB_H += strbuf.h LIB_OBJS += strbuf.o LIB_H += diff.h -LIB_OBJS += diff.o diffcore-rename.o +LIB_OBJS += diff.o diffcore-rename.o diffcore-pickaxe.o LIB_OBJS += gitenv.o @@ -125,6 +125,7 @@ strbuf.o: $(LIB_H) gitenv.o: $(LIB_H) diff.o: $(LIB_H) diffcore-rename.o : $(LIB_H) +diffcore-pickaxe.o : $(LIB_H) test: all make -C t/ all diff --git a/diff-cache.c b/diff-cache.c --- a/diff-cache.c +++ b/diff-cache.c @@ -8,6 +8,7 @@ static int line_termination = '\n'; static int detect_rename = 0; static int reverse_diff = 0; static int diff_score_opt = 0; +static char *pickaxe = 0; /* A file entry went away or appeared */ static void show_file(const char *prefix, struct cache_entry *ce, unsigned char *sha1, unsigned int mode) @@ -153,7 +154,7 @@ static void mark_merge_entries(void) } static char *diff_cache_usage = -"git-diff-cache [-p] [-r] [-z] [-m] [-M] [-C] [-R] [--cached] <tree-ish>"; +"git-diff-cache [-p] [-r] [-z] [-m] [-M] [-C] [-R] [-S<string>] [--cached] <tree-ish>"; int main(int argc, char **argv) { @@ -194,6 +195,10 @@ int main(int argc, char **argv) reverse_diff = 1; continue; } + if (!strcmp(arg, "-S")) { + pickaxe = arg + 2; + continue; + } if (!strcmp(arg, "-m")) { match_nonexisting = 1; continue; @@ -208,8 +213,8 @@ int main(int argc, char **argv) if (argc != 2 || get_sha1(argv[1], tree_sha1)) usage(diff_cache_usage); - diff_setup(detect_rename, diff_score_opt, reverse_diff, - (generate_patch ? -1 : line_termination), + diff_setup(detect_rename, diff_score_opt, pickaxe, + reverse_diff, (generate_patch ? -1 : line_termination), NULL, 0); mark_merge_entries(); diff --git a/diff-files.c b/diff-files.c --- a/diff-files.c +++ b/diff-files.c @@ -7,13 +7,14 @@ #include "diff.h" static const char *diff_files_usage = -"git-diff-files [-p] [-q] [-r] [-z] [-M] [-C] [-R] [paths...]"; +"git-diff-files [-p] [-q] [-r] [-z] [-M] [-C] [-R] [-S<string>] [paths...]"; static int generate_patch = 0; static int line_termination = '\n'; static int detect_rename = 0; static int reverse_diff = 0; static int diff_score_opt = 0; +static char *pickaxe = 0; static int silent = 0; static int matches_pathspec(struct cache_entry *ce, char **spec, int cnt) @@ -67,6 +68,8 @@ int main(int argc, char **argv) line_termination = 0; else if (!strcmp(argv[1], "-R")) reverse_diff = 1; + else if (!strcmp(argv[1], "-S")) + pickaxe = argv[1] + 2; else if (!strncmp(argv[1], "-M", 2)) { diff_score_opt = diff_scoreopt_parse(argv[1]); detect_rename = generate_patch = 1; @@ -89,8 +92,8 @@ int main(int argc, char **argv) exit(1); } - diff_setup(detect_rename, diff_score_opt, reverse_diff, - (generate_patch ? -1 : line_termination), + diff_setup(detect_rename, diff_score_opt, pickaxe, + reverse_diff, (generate_patch ? -1 : line_termination), NULL, 0); for (i = 0; i < entries; i++) { diff --git a/diff-helper.c b/diff-helper.c --- a/diff-helper.c +++ b/diff-helper.c @@ -9,6 +9,7 @@ static int detect_rename = 0; static int diff_score_opt = 0; static int generate_patch = 1; +static char *pickaxe = 0; static int parse_oneside_change(const char *cp, int *mode, unsigned char *sha1, char *path) @@ -93,7 +94,7 @@ static int parse_diff_raw_output(const c } static const char *diff_helper_usage = - "git-diff-helper [-z] [-R] [-M] [-C] paths..."; + "git-diff-helper [-z] [-R] [-M] [-C] [-S<string>] paths..."; int main(int ac, const char **av) { struct strbuf sb; @@ -117,14 +118,17 @@ int main(int ac, const char **av) { detect_rename = 2; diff_score_opt = diff_scoreopt_parse(av[1]); } + else if (av[1][1] == 'S') { + pickaxe = av[1] + 2; + } else usage(diff_helper_usage); ac--; av++; } /* the remaining parameters are paths patterns */ - diff_setup(detect_rename, diff_score_opt, reverse, - (generate_patch ? -1 : line_termination), + diff_setup(detect_rename, diff_score_opt, pickaxe, + reverse, (generate_patch ? -1 : line_termination), av+1, ac-1); while (1) { diff --git a/diff-tree.c b/diff-tree.c --- a/diff-tree.c +++ b/diff-tree.c @@ -13,6 +13,7 @@ static int generate_patch = 0; static int detect_rename = 0; static int reverse_diff = 0; static int diff_score_opt = 0; +static char *pickaxe = 0; static const char *header = NULL; static const char *header_prefix = ""; @@ -271,8 +272,8 @@ static int diff_tree_sha1_top(const unsi { int ret; - diff_setup(detect_rename, diff_score_opt, reverse_diff, - (generate_patch ? -1 : line_termination), + diff_setup(detect_rename, diff_score_opt, pickaxe, + reverse_diff, (generate_patch ? -1 : line_termination), NULL, 0); ret = diff_tree_sha1(old, new, base); diff_flush(); @@ -285,8 +286,8 @@ static int diff_root_tree(const unsigned void *tree; unsigned long size; - diff_setup(detect_rename, diff_score_opt, reverse_diff, - (generate_patch ? -1 : line_termination), + diff_setup(detect_rename, diff_score_opt, pickaxe, + reverse_diff, (generate_patch ? -1 : line_termination), NULL, 0); tree = read_object_with_reference(new, "tree", &size, NULL); if (!tree) @@ -430,7 +431,7 @@ static int diff_tree_stdin(char *line) } static char *diff_tree_usage = -"git-diff-tree [-p] [-r] [-z] [--stdin] [-M] [-C] [-R] [-m] [-s] [-v] <tree-ish> <tree-ish>"; +"git-diff-tree [-p] [-r] [-z] [--stdin] [-M] [-C] [-R] [-S<string>] [-m] [-s] [-v] <tree-ish> <tree-ish>"; int main(int argc, char **argv) { @@ -473,6 +474,10 @@ int main(int argc, char **argv) recursive = generate_patch = 1; continue; } + if (!strncmp(arg, "-S", 2)) { + pickaxe = arg + 2; + continue; + } if (!strncmp(arg, "-M", 2)) { detect_rename = recursive = generate_patch = 1; diff_score_opt = diff_scoreopt_parse(arg); diff --git a/diff.c b/diff.c --- a/diff.c +++ b/diff.c @@ -17,6 +17,7 @@ static int reverse_diff; static int diff_raw_output = -1; static const char **pathspec; static int speccnt; +static const char *pickaxe; static int minimum_score; static const char *external_diff(void) @@ -511,8 +512,9 @@ int diff_scoreopt_parse(const char *opt) return MAX_SCORE * num / scale; } -void diff_setup(int detect_rename_, int minimum_score_, int reverse_diff_, - int diff_raw_output_, +void diff_setup(int detect_rename_, int minimum_score_, + char *pickaxe_, + int reverse_diff_, int diff_raw_output_, const char **pathspec_, int speccnt_) { detect_rename = detect_rename_; @@ -521,15 +523,16 @@ void diff_setup(int detect_rename_, int diff_raw_output = diff_raw_output_; speccnt = speccnt_; minimum_score = minimum_score_ ? : DEFAULT_MINIMUM_SCORE; + pickaxe = pickaxe_; } static struct diff_queue_struct queued_diff; -struct diff_file_pair *diff_queue(struct diff_queue_struct *queue, +struct diff_filepair *diff_queue(struct diff_queue_struct *queue, struct diff_filespec *one, struct diff_filespec *two) { - struct diff_file_pair *dp = xmalloc(sizeof(*dp)); + struct diff_filepair *dp = xmalloc(sizeof(*dp)); dp->one = one; dp->two = two; dp->xfrm_msg = 0; @@ -549,7 +552,7 @@ static const char *git_object_type(unsig return S_ISDIR(mode) ? "tree" : "blob"; } -static void diff_flush_raw(struct diff_file_pair *p) +static void diff_flush_raw(struct diff_filepair *p) { struct diff_filespec *it; int addremove; @@ -583,7 +586,7 @@ static void diff_flush_raw(struct diff_f sha1_to_hex(it->sha1), it->path, diff_raw_output); } -static void diff_flush_patch(struct diff_file_pair *p) +static void diff_flush_patch(struct diff_filepair *p) { const char *name, *other; @@ -600,7 +603,7 @@ static int identical(struct diff_filespe { /* This function is written stricter than necessary to support * the currently implemented transformers, but the idea is to - * let transformers to produce diff_file_pairs any way they want, + * let transformers to produce diff_filepairs any way they want, * and filter and clean them up here before producing the output. */ @@ -623,7 +626,7 @@ static int identical(struct diff_filespe return 0; } -static void diff_flush_one(struct diff_file_pair *p) +static void diff_flush_one(struct diff_filepair *p) { if (identical(p->one, p->two)) return; @@ -640,11 +643,13 @@ void diff_flush(void) if (detect_rename) diff_detect_rename(q, detect_rename, minimum_score); + if (pickaxe) + diff_pickaxe(q, pickaxe); for (i = 0; i < q->nr; i++) diff_flush_one(q->queue[i]); for (i = 0; i < q->nr; i++) { - struct diff_file_pair *p = q->queue[i]; + struct diff_filepair *p = q->queue[i]; diff_free_filespec_data(p->one); diff_free_filespec_data(p->two); free(p->xfrm_msg); diff --git a/diff.h b/diff.h --- a/diff.h +++ b/diff.h @@ -20,6 +20,7 @@ extern void diff_unmerge(const char *pat extern int diff_scoreopt_parse(const char *opt); extern void diff_setup(int detect_rename, int minimum_score, + char *pickaxe, int reverse, int raw_output, const char **spec, int cnt); diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c new file mode 100644 --- /dev/null +++ b/diffcore-pickaxe.c @@ -0,0 +1,56 @@ +/* + * Copyright (C) 2005 Junio C Hamano + */ +#include "cache.h" +#include "diff.h" +#include "diffcore.h" +#include "delta.h" + +static int contains(struct diff_filespec *one, + const char *needle, unsigned long len) +{ + unsigned long offset, sz; + const char *data; + if (diff_populate_filespec(one)) + return 0; + sz = one->size; + data = one->data; + for (offset = 0; offset + len <= sz; offset++) + if (!strncmp(needle, data + offset, len)) + return 1; + return 0; +} + +void diff_pickaxe(struct diff_queue_struct *q, const char *needle) +{ + unsigned long len = strlen(needle); + int i; + struct diff_queue_struct outq; + outq.queue = NULL; + outq.nr = outq.alloc = 0; + + for (i = 0; i < q->nr; i++) { + struct diff_filepair *p = q->queue[i]; + if (!p->one->file_valid) { + if (!p->two->file_valid) + continue; /* ignore nonsense */ + /* created */ + if (contains(p->two, needle, len)) + diff_queue(&outq, p->one, p->two); + } + else if (!p->two->file_valid) { + if (contains(p->one, needle, len)) + diff_queue(&outq, p->one, p->two); + } + else if (contains(p->one, needle, len) != + contains(p->two, needle, len)) + diff_queue(&outq, p->one, p->two); + } + for (i = 0; i < q->nr; i++) { + struct diff_filepair *p = q->queue[i]; + free(p); + } + free(q->queue); + *q = outq; + return; +} diff --git a/diffcore-rename.c b/diffcore-rename.c --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -129,7 +129,7 @@ static void record_rename_pair(struct di * To achieve this sort order, we give xform_work the number * above. */ - struct diff_file_pair *dp = diff_queue(outq, src, dst); + struct diff_filepair *dp = diff_queue(outq, src, dst); dp->xfrm_work = (rank * 2 + 1) | (score<<RENAME_SCORE_SHIFT); dst->xfrm_flags |= RENAME_DST_MATCHED; } @@ -148,7 +148,7 @@ static void debug_filespec(struct diff_f s->size, s->xfrm_flags); } -static void debug_filepair(const struct diff_file_pair *p, int i) +static void debug_filepair(const struct diff_filepair *p, int i) { debug_filespec(p->one, i, "one"); debug_filespec(p->two, i, "two"); @@ -165,7 +165,7 @@ static void debug_queue(const char *msg, fprintf(stderr, "%s\n", msg); fprintf(stderr, "q->nr = %d\n", q->nr); for (i = 0; i < q->nr; i++) { - struct diff_file_pair *p = q->queue[i]; + struct diff_filepair *p = q->queue[i]; debug_filepair(p, i); } } @@ -180,8 +180,8 @@ static void debug_queue(const char *msg, */ static int rank_compare(const void *a_, const void *b_) { - const struct diff_file_pair *a = *(const struct diff_file_pair **)a_; - const struct diff_file_pair *b = *(const struct diff_file_pair **)b_; + const struct diff_filepair *a = *(const struct diff_filepair **)a_; + const struct diff_filepair *b = *(const struct diff_filepair **)b_; int a_rank = a->xfrm_work & ((1<<RENAME_SCORE_SHIFT) - 1); int b_rank = b->xfrm_work & ((1<<RENAME_SCORE_SHIFT) - 1); @@ -207,7 +207,7 @@ static int needs_to_stay(struct diff_que * as the source of rename/copy), we need to copy, not rename. */ while (i < q->nr) { - struct diff_file_pair *p = q->queue[i++]; + struct diff_filepair *p = q->queue[i++]; if (!p->two->file_valid) continue; /* removed is fine */ if (strcmp(p->one->path, it->path)) @@ -243,15 +243,8 @@ void diff_detect_rename(struct diff_queu srcs[0] = &deleted; srcs[1] = &stay; - /* NEEDSWORK: - * (1) make sure we properly ignore but pass trees. - * - * (2) make sure we do right thing on the same path deleted - * and created in the same patch. - */ - for (i = 0; i < q->nr; i++) { - struct diff_file_pair *p = q->queue[i]; + struct diff_filepair *p = q->queue[i]; if (!p->one->file_valid) if (!p->two->file_valid) continue; /* ignore nonsense */ @@ -340,11 +333,11 @@ void diff_detect_rename(struct diff_queu * See comments at the top of record_rename_pair for numbers used * to assign xfrm_work. * - * Note that we have not annotated the diff_file_pair with any comment + * Note that we have not annotated the diff_filepair with any comment * so there is nothing other than p to free. */ for (i = 0; i < q->nr; i++) { - struct diff_file_pair *dp, *p = q->queue[i]; + struct diff_filepair *dp, *p = q->queue[i]; if (!p->one->file_valid) { if (p->two->file_valid) { /* creation */ @@ -378,7 +371,7 @@ void diff_detect_rename(struct diff_queu /* Copy it out to q, removing duplicates. */ for (i = 0; i < outq.nr; i++) { - struct diff_file_pair *p = outq.queue[i]; + struct diff_filepair *p = outq.queue[i]; if (!p->one->file_valid) { /* created */ if (p->two->xfrm_flags & RENAME_DST_MATCHED) @@ -395,7 +388,7 @@ void diff_detect_rename(struct diff_queu } else if (strcmp(p->one->path, p->two->path)) { /* rename or copy */ - struct diff_file_pair *dp = + struct diff_filepair *dp = diff_queue(q, p->one, p->two); int msglen = (strlen(p->one->path) + strlen(p->two->path) + 100); diff --git a/diffcore.h b/diffcore.h --- a/diffcore.h +++ b/diffcore.h @@ -38,7 +38,7 @@ extern void fill_filespec(struct diff_fi extern int diff_populate_filespec(struct diff_filespec *); extern void diff_free_filespec_data(struct diff_filespec *); -struct diff_file_pair { +struct diff_filepair { struct diff_filespec *one; struct diff_filespec *two; char *xfrm_msg; @@ -47,14 +47,15 @@ struct diff_file_pair { }; struct diff_queue_struct { - struct diff_file_pair **queue; + struct diff_filepair **queue; int alloc; int nr; }; -extern struct diff_file_pair *diff_queue(struct diff_queue_struct *, - struct diff_filespec *, - struct diff_filespec *); +extern struct diff_filepair *diff_queue(struct diff_queue_struct *, + struct diff_filespec *, + struct diff_filespec *); extern void diff_detect_rename(struct diff_queue_struct *, int, int); +extern void diff_pickaxe(struct diff_queue_struct *, const char *); #endif ------------------------------------------------ ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH] Constness fix for pickaxe option. 2005-05-21 9:40 ` [PATCH 2/3] Introducing software archaeologist's tool "pickaxe" Junio C Hamano @ 2005-05-21 22:02 ` Junio C Hamano 2005-05-21 22:16 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 22:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Constness fix for pickaxe option. Signed-off-by: Junio C Hamano <junkio@cox.net> --- diff-cache.c | 2 +- diff-files.c | 2 +- diff-helper.c | 2 +- diff-tree.c | 2 +- diff.c | 2 +- diff.h | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/diff-cache.c b/diff-cache.c --- a/diff-cache.c +++ b/diff-cache.c @@ -8,7 +8,7 @@ static int line_termination = '\n'; static int detect_rename = 0; static int reverse_diff = 0; static int diff_score_opt = 0; -static char *pickaxe = 0; +static const char *pickaxe = 0; /* A file entry went away or appeared */ static void show_file(const char *prefix, struct cache_entry *ce, unsigned char *sha1, unsigned int mode) diff --git a/diff-files.c b/diff-files.c --- a/diff-files.c +++ b/diff-files.c @@ -14,7 +14,7 @@ static int line_termination = '\n'; static int detect_rename = 0; static int reverse_diff = 0; static int diff_score_opt = 0; -static char *pickaxe = 0; +static const char *pickaxe = 0; static int silent = 0; static int matches_pathspec(struct cache_entry *ce, char **spec, int cnt) diff --git a/diff-helper.c b/diff-helper.c --- a/diff-helper.c +++ b/diff-helper.c @@ -9,7 +9,7 @@ static int detect_rename = 0; static int diff_score_opt = 0; static int generate_patch = 1; -static char *pickaxe = 0; +static const char *pickaxe = 0; static int parse_oneside_change(const char *cp, int *mode, unsigned char *sha1, char *path) diff --git a/diff-tree.c b/diff-tree.c --- a/diff-tree.c +++ b/diff-tree.c @@ -13,7 +13,7 @@ static int generate_patch = 0; static int detect_rename = 0; static int reverse_diff = 0; static int diff_score_opt = 0; -static char *pickaxe = 0; +static const char *pickaxe = 0; static const char *header = NULL; static const char *header_prefix = ""; diff --git a/diff.c b/diff.c --- a/diff.c +++ b/diff.c @@ -513,7 +513,7 @@ int diff_scoreopt_parse(const char *opt) } void diff_setup(int detect_rename_, int minimum_score_, - char *pickaxe_, + const char *pickaxe_, int reverse_diff_, int diff_raw_output_, const char **pathspec_, int speccnt_) { diff --git a/diff.h b/diff.h --- a/diff.h +++ b/diff.h @@ -20,7 +20,7 @@ extern void diff_unmerge(const char *pat extern int diff_scoreopt_parse(const char *opt); extern void diff_setup(int detect_rename, int minimum_score, - char *pickaxe, + const char *pickaxe, int reverse, int raw_output, const char **spec, int cnt); ------------------------------------------------ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Constness fix for pickaxe option. 2005-05-21 22:02 ` [PATCH] Constness fix for pickaxe option Junio C Hamano @ 2005-05-21 22:16 ` Linus Torvalds 0 siblings, 0 replies; 43+ messages in thread From: Linus Torvalds @ 2005-05-21 22:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 21 May 2005, Junio C Hamano wrote: > > Constness fix for pickaxe option. Btw, this: static const char *pickaxe = 0; may be legal C, but that doesn't make it less crap. It's a pointer. If you want to make it a NULL pointer, make it a NULL pointer: static const char *pickaxe = NULL; and don't try to stuff an integer into a pointer. Yeah, yeah, the integer zero is magic and special, but that's a bug in the C language, and we should try to not perpetuate it. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 3/3] Diff overhaul, adding the other half of copy detection. 2005-05-21 9:37 ` Junio C Hamano 2005-05-21 9:39 ` [PATCH 1/3] Diff overhaul, adding half of copy detection Junio C Hamano 2005-05-21 9:40 ` [PATCH 2/3] Introducing software archaeologist's tool "pickaxe" Junio C Hamano @ 2005-05-21 9:42 ` Junio C Hamano 2005-05-21 10:11 ` [PATCH] Teach diff-tree to report unmodified paths for -C option Junio C Hamano ` (3 more replies) 2 siblings, 4 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 9:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: git This patch extends diff-cache and diff-files to report the unmodified files to diff-core as well when -C (copy detection) is in effect, so that the unmodified files can also be used as the source candidates. The existing test t4003 has been extended to cover this case. Signed-off-by: Junio C Hamano <junkio@cox.net> --- diff-cache.c | 3 ++- diff-files.c | 2 +- t/t4003-diff-rename-1.sh | 38 +++++++++++++++++++++++++++++++++++++- 3 files changed, 40 insertions(+), 3 deletions(-) diff --git a/diff-cache.c b/diff-cache.c --- a/diff-cache.c +++ b/diff-cache.c @@ -71,7 +71,8 @@ static int show_modified(struct cache_en } oldmode = old->ce_mode; - if (mode == oldmode && !memcmp(sha1, old->sha1, 20)) + if (mode == oldmode && !memcmp(sha1, old->sha1, 20) && + detect_rename < 2) return 0; mode = ntohl(mode); diff --git a/diff-files.c b/diff-files.c --- a/diff-files.c +++ b/diff-files.c @@ -126,7 +126,7 @@ int main(int argc, char **argv) continue; } changed = ce_match_stat(ce, &st); - if (!changed) + if (!changed && detect_rename < 2) continue; oldmode = ntohl(ce->ce_mode); diff --git a/t/t4003-diff-rename-1.sh b/t/t4003-diff-rename-1.sh --- a/t/t4003-diff-rename-1.sh +++ b/t/t4003-diff-rename-1.sh @@ -22,6 +22,10 @@ test_expect_success \ rm -f COPYING && git-update-cache --add --remove COPYING COPYING.?' +# tree has COPYING. work tree has COPYING.1 and COPYING.2, +# both are slightly edited. So we say you copy-and-edit one, +# and rename-and-edit the other. + GIT_DIFF_OPTS=-u0 git-diff-cache -M $tree | sed -e 's/\([0-9][0-9]*\)/#/g' >current && cat >expected <<\EOF @@ -58,7 +62,11 @@ test_expect_success \ test_expect_success \ 'prepare work tree again' \ 'mv COPYING.2 COPYING && - git-update-cache --add --remove COPYING COPYING.1' + git-update-cache --add --remove COPYING COPYING.1 COPYING.2' + +# tree has COPYING. work tree has COPYING and COPYING.1, +# both are slightly edited. So we say you edited one, +# and copy-and-edit the other. GIT_DIFF_OPTS=-u0 git-diff-cache -C $tree | sed -e 's/\([0-9][0-9]*\)/#/g' >current @@ -90,4 +98,32 @@ test_expect_success \ 'validate output from rename/copy detection' \ 'diff -u current expected' +test_expect_success \ + 'prepare work tree once again' \ + 'cat ../../COPYING >COPYING && + git-update-cache --add --remove COPYING COPYING.1' + +# tree has COPYING. work tree has the same COPYING and COPYING.1, +# but COPYING is not edited. We say you copy-and-edit COPYING.1; +# this is only possible because -C mode now reports the unmodified +# file to the diff-core. + +GIT_DIFF_OPTS=-u0 git-diff-cache -C $tree | +sed -e 's/\([0-9][0-9]*\)/#/g' >current +cat >expected <<\EOF +diff --git a/COPYING b/COPYING.# +similarity index #% +copy from COPYING +copy to COPYING.# +--- a/COPYING ++++ b/COPYING.# +@@ -# +# @@ +- HOWEVER, in order to allow a migration to GPLv# if that seems like ++ However, in order to allow a migration to GPLv# if that seems like +EOF + +test_expect_success \ + 'validate output from rename/copy detection' \ + 'diff -u current expected' + test_done ------------------------------------------------ ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH] Teach diff-tree to report unmodified paths for -C option. 2005-05-21 9:42 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Junio C Hamano @ 2005-05-21 10:11 ` Junio C Hamano 2005-05-21 10:51 ` Junio C Hamano 2005-05-21 17:25 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Linus Torvalds ` (2 subsequent siblings) 3 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 10:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus, I have a bit of design issue that there is no way for the diff-core layer to affect the behaviour of the caller. It is not too big a problem with others, but for diff-tree the effect is prominent. If you do this: ./git-rev-list HEAD | ./git-diff-tree -Sdiff-tree-helper --stdin -M you will get many uninteresting headers from git-diff-tree, until you hit something "interesting". What I want to see in this case is probably to omit the header and commit information diff-tree usually gives while "pickaxe" filters everything out (i.e. nothing interesting to report), and teach diff-tree to do the header thing only when diff-core says there is something interesting. So far I haven't found a good way to do this. Another useless comment. For obvious reasons, there is nothing we can do about the diff-helper to add "the other half of copy detection information", because what it can tell diff-core is limited to its input, which usually is just differences prepared by somebody else, and it cannot know anything about unchanged files. When I started pushing '-p' flag to diff-tree family, I remember that your reaction was neutral to moderately negative ("I'd tolerate, although I think it is redundant and you are not even generating diff yourself anyway" as opposed to "That's just great"). I think now you would thank me for shoving the diff interface into them ;-). I did not touch diff-tree for full-scale -C option in the last series, but if you want to have it, it is easy. This one is untested but it should just work (TM). Signed-off-by: Junio C Hamano <junkio@cox.net> --- # - HEAD: Diff overhaul, adding the other half of copy detection. # + (working tree) diff --git a/diff-tree.c b/diff-tree.c --- a/diff-tree.c +++ b/diff-tree.c @@ -117,7 +117,7 @@ static int compare_tree_entry(void *tree show_file("+", tree2, size2, base); return 1; } - if (!memcmp(sha1, sha2, 20) && mode1 == mode2) + if (!memcmp(sha1, sha2, 20) && mode1 == mode2 && detect_rename < 2) return 0; /* ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] Teach diff-tree to report unmodified paths for -C option. 2005-05-21 10:11 ` [PATCH] Teach diff-tree to report unmodified paths for -C option Junio C Hamano @ 2005-05-21 10:51 ` Junio C Hamano 0 siblings, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 10:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "JCH" == Junio C Hamano <junkio@cox.net> writes: JCH> Linus, I have a bit of design issue that... Another thing I forgot to mention, even in the documentation. Currently all the diff-tree family says -M and -C imply -p which is by itself sensible, because rename-edit and copy-edit cannot be expressed with the diff-raw format (the format only takes one path). On the other hand, pickaxe is really a diff filter, so it can work with both diff-raw and diff-patch format. Here is my favorite example, in git.pasky aka cogito archive: ./git-rev-list pasky | ./git-diff-tree --stdin -S'static const char *tag_cached = ""; static const char *tag_unmerged = ""; static const char *tag_removed = "";' ..... ffbe1addd5a5b7b7c2f987625a5aa6c1d22e3705 (from 20d37ef67286e5131d2333d7b4662bc70f9d4937) 20d37ef67286e5131d2333d7b4662bc70f9d4937 (from e78d97723cd77d46d8767a5a27965077249fd080) *100644->100644 blob 58b5aee94ee76c9201c580d271fcb6aaf20d421f->a2d8cbb2f8d0090a496177cd38f252786d29c3d6 ls-files.c e78d97723cd77d46d8767a5a27965077249fd080 (from cc167ccaeb1adcdc392f9e03ed1225762ea3cf96) cc167ccaeb1adcdc392f9e03ed1225762ea3cf96 (from 4df1e7950787197f7cad081b98f9ffe6c4089fc9) ..... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half of copy detection. 2005-05-21 9:42 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Junio C Hamano 2005-05-21 10:11 ` [PATCH] Teach diff-tree to report unmodified paths for -C option Junio C Hamano @ 2005-05-21 17:25 ` Linus Torvalds 2005-05-21 18:10 ` [PATCH 3/3] Diff overhaul, adding the other half Junio C Hamano 2005-05-21 22:55 ` [PATCH] Tweak diffcore-rename heuristics Junio C Hamano [not found] ` <Pine.LNX.4.58.0505211004400.2206@ppc970.osdl.org> 3 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2005-05-21 17:25 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 21 May 2005, Junio C Hamano wrote: > > This patch extends diff-cache and diff-files to report the > unmodified files to diff-core as well when -C (copy detection) > is in effect, so that the unmodified files can also be used as > the source candidates. The existing test t4003 has been > extended to cover this case. I love how I can just say "oh, keep in mind that we might want to.." and 24 hours later you did it. Applied and pushed out, and I enjoyed seeing the output of git-whatchanged -C and doing a "/^copy" that shows it figuring out (correctly) that git-resolve-script was based on git-pull-script. Very cool. I'm also somewhat surprised by the fact that it even seems to be usable on the kernel tree: diff --git a/include/asm-um/archparam-i386.h b/include/asm-um/elf-i386.h similarity index 89% copy from include/asm-um/archparam-i386.h copy to include/asm-um/elf-i386.h --- a/include/asm-um/archparam-i386.h +++ b/include/asm-um/elf-i386.h ... cool. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-21 17:25 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Linus Torvalds @ 2005-05-21 18:10 ` Junio C Hamano 2005-05-21 18:27 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 18:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> I love how I can just say "oh, keep in mind that we might LT> want to.." and 24 hours later you did it. Nah, it's more like 40 hours, but yes I am here to serve your wishes ;-). LT> I'm also somewhat surprised by the fact that it even seems LT> to be usable on the kernel tree: Surprised? Correctness-wise and/or performance-wise? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-21 18:10 ` [PATCH 3/3] Diff overhaul, adding the other half Junio C Hamano @ 2005-05-21 18:27 ` Linus Torvalds 2005-05-21 18:34 ` Linus Torvalds 2005-05-21 19:54 ` [PATCH 3/3] Diff overhaul, adding the other half Junio C Hamano 0 siblings, 2 replies; 43+ messages in thread From: Linus Torvalds @ 2005-05-21 18:27 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 21 May 2005, Junio C Hamano wrote: > > LT> I'm also somewhat surprised by the fact that it even seems > LT> to be usable on the kernel tree: > > Surprised? Correctness-wise and/or performance-wise? Performance-wise. It seems to be quite usable, even doing just a plain "git-whatchanged -C" on the kernel with no limits on what it does. Now, all of the actual test-cases I looked at were actually parts of patches where the source file _had_ been modified, so I didn't see a case where it selected any of the random 17,000 files that were _not_ modified, and I didn't double-check further than your commit message saying that it really does that, so.. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-21 18:27 ` Linus Torvalds @ 2005-05-21 18:34 ` Linus Torvalds 2005-05-21 18:45 ` Linus Torvalds 2005-05-21 19:54 ` [PATCH 3/3] Diff overhaul, adding the other half Junio C Hamano 1 sibling, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2005-05-21 18:34 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 21 May 2005, Linus Torvalds wrote: > > Now, all of the actual test-cases I looked at were actually parts of > patches where the source file _had_ been modified, so I didn't see a case > where it selected any of the random 17,000 files that were _not_ modified, > and I didn't double-check further than your commit message saying that it > really does that, so.. Oh, I decided to double-check, and no, it doesn't actually do a full copy check for diff-tree. Only for diff-cache and diff-files. Which is a sensible default, and I note that you sent a separate email for testing the extreme case. I'll try that out too, just for fun, Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-21 18:34 ` Linus Torvalds @ 2005-05-21 18:45 ` Linus Torvalds 2005-05-21 20:10 ` Junio C Hamano 2005-05-24 5:37 ` Junio C Hamano 0 siblings, 2 replies; 43+ messages in thread From: Linus Torvalds @ 2005-05-21 18:45 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 21 May 2005, Linus Torvalds wrote: > > Which is a sensible default, and I note that you sent a separate email for > testing the extreme case. I'll try that out too, just for fun, Hmm.. It's not working well. Not only does it take a lot of CPU time (do an fsck first to make sure you're not seekign the disk all over the place), but it "finds" lots of things like this: diff --git a/drivers/usb/misc/emi62_fw_m.h b/drivers/media/dvb/bt8xx/dst_ca.h similarity index 99% copy from drivers/usb/misc/emi62_fw_m.h copy to drivers/media/dvb/bt8xx/dst_ca.h --- a/drivers/usb/misc/emi62_fw_m.h +++ b/drivers/media/dvb/bt8xx/dst_ca.h @@ -1,8853 +1,58 @@ /* - * This file is generated from three different files, provided by Emagic. - */ -/* generated Tue Jun 3 21:36:11 EEST 2003 */ + CA-driver for TwinHan DST Frontend/Card which looks quite bogus (it they aren't similar at all, and the diff is huge). I think that your similarity check has a tendency to do bad things if one of the files is huge: in this case we have torvalds@ppc970:~/v2.6/linux> wc -c drivers/media/dvb/bt8xx/dst_ca.h drivers/usb/misc/emi62_fw_m.h 1591 drivers/media/dvb/bt8xx/dst_ca.h 795679 drivers/usb/misc/emi62_fw_m.h 797270 total and you consider them "similar", probably because it turns out that a delta that just removes everything is very small (it's just a "delete bytes x-y") so you compare that "small" delta to a "large total file" and you think it's an almost perfect match. Now, for _renames_ that is actually half-way the right thing to do, but for copies, you should compare the size not to the _sum_ of the two files, but to just the size of the file that you generate. But it's a fun example ;) Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-21 18:45 ` Linus Torvalds @ 2005-05-21 20:10 ` Junio C Hamano 2005-05-24 5:37 ` Junio C Hamano 1 sibling, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 20:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Hmm.. It's not working well. Not only does it take a lot of LT> CPU time... LT> torvalds@ppc970:~/v2.6/linux> wc -c drivers/media/dvb/bt8xx/dst_ca.h drivers/usb/misc/emi62_fw_m.h LT> 1591 drivers/media/dvb/bt8xx/dst_ca.h LT> 795679 drivers/usb/misc/emi62_fw_m.h LT> 797270 total Yup. diffcore-rename bases the similarity index on the average of src and dst files. Maybe the similarity check should also say they should be comparable size, like this upfront before even going into delta: abs(src-dst) < avg(src,dst) * (1 - minscore/maxscore) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-21 18:45 ` Linus Torvalds 2005-05-21 20:10 ` Junio C Hamano @ 2005-05-24 5:37 ` Junio C Hamano 2005-05-24 6:19 ` Linus Torvalds 1 sibling, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-24 5:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Hmm.. It's not working well. Not only does it take a lot of CPU time (do LT> an fsck first to make sure you're not seekign the disk all over the LT> place), but it "finds" lots of things like this: The "finds logs of funny things" problem should have been fixed by now, and I am fixing a big screwup now as I reported. I have two ideas on speeding up diff-tree -C I want to run by you. I have not measured things yet, but I think the big CPU waste is coming from either expanding all the blobs and/or running the diff-delta on many file pairs. If that is indeed the cause, then helping the upfront check in the similarity estimator that refuses to consider a file pair whose file size change is too big may be a good way to resolve this problem. One approach, which I think is an unacceptable change at this stage (but I would seriously consider if this _were_ a week and half old project), is to record the blob size as part of the object ID. We say object size is "unsigned long" everywhere, so I am talking about making the object ID from 20-byte SHA1 to 24-byte SHA1 plus 4-byte integer in the network byte order. Since I think the above is inpractical, the second best approach would be to piggy-back on the optimization used in uncached diff-cache, which avoids blob expansion if cache says what we have in the work tree already matches the object we are interested in. When -C is in effect, we would make diff-tree read the current cache first, so that diff_populate_filespec() can borrow from the current work tree when a path in the tree we are looking at has not changed. This would obviously be effective only when we are talking about recent history. Thoughts? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-24 5:37 ` Junio C Hamano @ 2005-05-24 6:19 ` Linus Torvalds 2005-05-24 8:16 ` Junio C Hamano 2005-05-26 3:17 ` [RFC/PATCH] Detect copies harder in diff-tree Junio C Hamano 0 siblings, 2 replies; 43+ messages in thread From: Linus Torvalds @ 2005-05-24 6:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, 23 May 2005, Junio C Hamano wrote: > > I have not measured things yet, but I think the big CPU waste is > coming from either expanding all the blobs and/or running the > diff-delta on many file pairs. If that is indeed the cause, > then helping the upfront check in the similarity estimator that > refuses to consider a file pair whose file size change is too > big may be a good way to resolve this problem. Since pretty much all the blobs will be expanded in the working directory anyway, it sounds like that would be the way to go. > One approach, which I think is an unacceptable change at this > stage (but I would seriously consider if this _were_ a week and > half old project), is to record the blob size as part of the > object ID. We say object size is "unsigned long" everywhere, so > I am talking about making the object ID from 20-byte SHA1 to > 24-byte SHA1 plus 4-byte integer in the network byte order. You can actually get the blob size fairly easily for non-delta objects, by just unpacking the beginning of it. But since we have the files.. That said, I don't think -C is that important. I personally don't see it as a thing I'd run normally - it's more of a thing I might do between releases rather than for something like git-whatchanged that looks at every commit. It's an interesting thing to have _available_, but I don't think it's a huge problem if it is a lot more expensive than the more normal "-M". Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-24 6:19 ` Linus Torvalds @ 2005-05-24 8:16 ` Junio C Hamano 2005-05-24 8:31 ` Linus Torvalds 2005-05-26 3:17 ` [RFC/PATCH] Detect copies harder in diff-tree Junio C Hamano 1 sibling, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2005-05-24 8:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Since pretty much all the blobs will be expanded in the working directory LT> anyway, it sounds like that would be the way to go. LT> That said, I don't think -C is that important... OK, so the short version is, diff-cache like optimization may be interesting to try out, but practically it would not be much useful anyway, so I should do it if I am really bored and have nothing else interesting to do ;-). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-24 8:16 ` Junio C Hamano @ 2005-05-24 8:31 ` Linus Torvalds 2005-05-24 9:05 ` Junio C Hamano 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2005-05-24 8:31 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Tue, 24 May 2005, Junio C Hamano wrote: > > OK, so the short version is, diff-cache like optimization may be > interesting to try out, but practically it would not be much > useful anyway, so I should do it if I am really bored and have > nothing else interesting to do ;-). Yup. I think it's more important to get the rest calmed down again, and fix the things that got broken. Sadly, "git-whatchanged -s" was one such thing. (I think that's just because the "silent" test used to depend on the magical behaviour of the "header" thing, and now that the header generation and suppression is sane, "silent" doesn't work any more) Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-24 8:31 ` Linus Torvalds @ 2005-05-24 9:05 ` Junio C Hamano 0 siblings, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-24 9:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> (I think that's just because the "silent" test used to depend on the LT> magical behaviour of the "header" thing, and now that the header LT> generation and suppression is sane, "silent" doesn't work any more) I think you will be more efficient for this task but I'm willing to volunteer if you let me know how "silent" should behave. The documentation says it is useful only with -v and supresses the diffs, so if that is the only thing it does, I think something like this is sufficient? Not tested enough but I am going to crash for the day now. ------------ Use DIFF_FORMAT_NO_OUTPUT to implement diff-tree -s option. Instead of checking silent flag all over the place, simply use the NO_OUTPUT option diffcore provides to suppress the diff output. Signed-off-by: Junio C Hamano <junkio@cox.net> --- cd /opt/packrat/playpen/public/in-place/git/git.junio/ jit-diff # - HEAD: Allow symlinks in the leading path in checkout-cache --prefix= # + (working tree) diff --git a/diff-tree.c b/diff-tree.c --- a/diff-tree.c +++ b/diff-tree.c @@ -2,7 +2,6 @@ #include "cache.h" #include "diff.h" -static int silent = 0; static int show_root_diff = 0; static int verbose_header = 0; static int ignore_merges = 1; @@ -67,9 +66,6 @@ static void show_file(const char *prefix const char *path; const unsigned char *sha1 = extract(tree, size, &path, &mode); - if (silent) - return; - if (recursive && S_ISDIR(mode)) { char type[20]; unsigned long size; @@ -132,9 +128,6 @@ static int compare_tree_entry(void *tree return retval; } - if (silent) - return 0; - diff_change(mode1, mode2, sha1, sha2, base, path1); return 0; } @@ -395,8 +388,7 @@ static char *generate_header(const char if (this_header[offset-1] != '\n') this_header[offset++] = '\n'; /* Add _another_ EOLN if we are doing diff output */ - if (!silent) - this_header[offset++] = '\n'; + this_header[offset++] = '\n'; this_header[offset] = 0; } @@ -442,8 +434,6 @@ static int diff_tree_commit(const unsign * Don't print multiple merge entries if we * don't print the diffs. */ - if (silent) - break; } offset += 48; } @@ -540,7 +530,7 @@ int main(int argc, const char **argv) continue; } if (!strcmp(arg, "-s")) { - silent = 1; + diff_output_format = DIFF_FORMAT_NO_OUTPUT; continue; } if (!strcmp(arg, "-v")) { Compilation finished at Tue May 24 02:03:10 ^ permalink raw reply [flat|nested] 43+ messages in thread
* [RFC/PATCH] Detect copies harder in diff-tree. 2005-05-24 6:19 ` Linus Torvalds 2005-05-24 8:16 ` Junio C Hamano @ 2005-05-26 3:17 ` Junio C Hamano 1 sibling, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-26 3:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> That said, I don't think -C is that important. By now, you know I won't listen ;-). I've done preliminary --detect-copies-harder (that is to feed all the unmodified files to diffcore when doing -C) and --use-cache (this is to allow diffcore to avoid expanding blob when there is a already matching file in the work tree) changes to diff-tree. This example is from the linux-2.6 tree, with the tip of the tree in the work tree and the cache, and looking at the commit when include/asm-um was cleaned up (May 5th). The first one does not detect copies from "unmodified" files, but the latter two do. In this commit there isn't any copy that "harder" version finds but ordinary one doesn't. : siamese; time ../git.junio/git-diff-tree -r \ -C dbc35cc73f2edd6e39d7e814dbb6eddad6294665 >/dev/null real 0m0.010s user 0m0.010s sys 0m0.000s : siamese; time ../git.junio/git-diff-tree -r --detect-copies-harder \ -C dbc35cc73f2edd6e39d7e814dbb6eddad6294665 >/dev/null real 0m19.938s user 0m11.520s sys 0m1.240s : siamese; time ../git.junio/git-diff-tree -r \ --detect-copies-harder --use-cache -C \ dbc35cc73f2edd6e39d7e814dbb6eddad6294665 >/dev/null real 0m5.858s user 0m5.110s sys 0m0.710s ------------ Add --detect-copies-harder and --use-cache to diff-tree. This adds two new options to diff-tree. Even when -C is used, diff-tree does not normally feed "unmodified" filepair to the diffcore, so copy detection is done only among the files that have changed. With --detect-copies-harder, it can also detect copies made from an unmodified file (this behavior is the default for diff-files and diff-cache). When this option is used, it is recommended to also give --use-cache, which lets diffcore to avoid expanding blob when the work tree has the same file unmodified. Signed-off-by: Junio C Hamano <junkio@cox.net> --- cd /opt/packrat/playpen/public/in-place/git/git.junio/ jit-diff # - linus: git-rev-list: add "end" commit and "--header" flag # + (working tree) diff --git a/diff-tree.c b/diff-tree.c --- a/diff-tree.c +++ b/diff-tree.c @@ -6,6 +6,8 @@ static int show_root_diff = 0; static int verbose_header = 0; static int ignore_merges = 1; static int recursive = 0; +static int use_cache = 0; +static int detect_copies_harder = 0; static int show_tree_entry_in_recursive = 0; static int read_stdin = 0; static int diff_output_format = DIFF_FORMAT_HUMAN; @@ -108,7 +110,8 @@ static int compare_tree_entry(void *tree show_file("+", tree2, size2, base); return 1; } - if (!memcmp(sha1, sha2, 20) && mode1 == mode2) + if (!memcmp(sha1, sha2, 20) && mode1 == mode2 && + (detect_rename != DIFF_DETECT_COPY || !detect_copies_harder)) return 0; /* @@ -549,6 +552,14 @@ int main(int argc, const char **argv) read_stdin = 1; continue; } + if (!strcmp(arg, "--use-cache")) { + use_cache = 1; + continue; + } + if (!strcmp(arg, "--detect-copies-harder")) { + detect_copies_harder = 1; + continue; + } if (!strcmp(arg, "--root")) { show_root_diff = 1; continue; @@ -566,6 +577,16 @@ int main(int argc, const char **argv) pathlens[i] = strlen(paths[i]); } + if (detect_rename && use_cache && !active_cache) { + /* read-cache does not die even when it fails + * so it is safe for us to do this here. Also + * it does not smudge active_cache or active_nr + * when it fails, so we do not have to worry about + * cleaning it up oufselves either. + */ + read_cache(); + } + switch (nr_sha1) { case 0: if (!read_stdin) Compilation finished at Wed May 25 19:59:09 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 3/3] Diff overhaul, adding the other half... 2005-05-21 18:27 ` Linus Torvalds 2005-05-21 18:34 ` Linus Torvalds @ 2005-05-21 19:54 ` Junio C Hamano 1 sibling, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 19:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> On Sat, 21 May 2005, Junio C Hamano wrote: >> LT> I'm also somewhat surprised by the fact that it even seems LT> to be usable on the kernel tree: >> >> Surprised? Correctness-wise and/or performance-wise? LT> Performance-wise. It seems to be quite usable, even doing just a plain LT> "git-whatchanged -C" on the kernel with no limits on what it does. That is probably because you do not have the feeding of "same" in diff-tree. ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH] Tweak diffcore-rename heuristics. 2005-05-21 9:42 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Junio C Hamano 2005-05-21 10:11 ` [PATCH] Teach diff-tree to report unmodified paths for -C option Junio C Hamano 2005-05-21 17:25 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Linus Torvalds @ 2005-05-21 22:55 ` Junio C Hamano [not found] ` <Pine.LNX.4.58.0505211004400.2206@ppc970.osdl.org> 3 siblings, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-21 22:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: git The heuristics so far was to compare file size change and xdelta size against the average of file size before and after the change. This patch uses the smaller of pre- and post- change file size instead. It also makes a very small performance fix. I didn't measure it; I do not expect it to make any practical difference, but while scanning an already sorted list, breaking out in the middle is the right thing. Signed-off-by: Junio C Hamano <junkio@cox.net> --- diffcore-rename.c | 38 ++++++++++++++++++++------------------ 1 files changed, 20 insertions(+), 18 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -68,37 +68,39 @@ static int estimate_similarity(struct di * else. */ void *delta; - unsigned long delta_size; + unsigned long delta_size, base_size; int score; delta_size = ((src->size < dst->size) ? (dst->size - src->size) : (src->size - dst->size)); + base_size = ((src->size < dst->size) ? src->size : dst->size); - /* We would not consider rename followed by more than - * minimum_score/MAX_SCORE edits; that is, delta_size must be smaller - * than (src->size + dst->size)/2 * minimum_score/MAX_SCORE, - * which means... + /* We would not consider edits that change the file size so + * drastically. delta_size must be smaller than + * minimum_score/MAX_SCORE * min(src->size, dst->size). + * Note that base_size == 0 case is handled here already + * and the final score computation below would not have a + * divide-by-zero issue. */ - - if ((src->size+dst->size)*minimum_score < delta_size*MAX_SCORE*2) + if (base_size * minimum_score < delta_size * MAX_SCORE) return 0; delta = diff_delta(src->data, src->size, dst->data, dst->size, &delta_size); + /* + * We currently punt here, but we may later end up parsing the + * delta to really assess the extent of damage. A big consecutive + * remove would produce small delta_size that affects quite a + * big portion of the file. + */ free(delta); - /* This "delta" is really xdiff with adler32 and all the - * overheads but it is a quick and dirty approximation. - * - * Now we will give some score to it. 100% edit gets - * 0 points and 0% edit gets MAX_SCORE points. That is, every - * 1/MAX_SCORE edit gets 1 point penalty. The amount of penalty is: - * - * (delta_size * 2 / (src->size + dst->size)) * MAX_SCORE - * + /* + * Now we will give some score to it. 100% edit gets 0 points + * and 0% edit gets MAX_SCORE points. */ - score = MAX_SCORE-(MAX_SCORE*2*delta_size/(src->size+dst->size)); + score = MAX_SCORE - (MAX_SCORE * delta_size / base_size); if (score < 0) return 0; if (MAX_SCORE < score) return MAX_SCORE; return score; @@ -314,7 +316,7 @@ void diff_detect_rename(struct diff_queu if (mx[i].dst->xfrm_flags & RENAME_DST_MATCHED) continue; /* alreayd done, either exact or fuzzy. */ if (mx[i].score < minimum_score) - continue; + break; /* there is not any more diffs applicable. */ record_rename_pair(&outq, mx[i].src, mx[i].dst, mx[i].rank, mx[i].score); ------------------------------------------------ ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <Pine.LNX.4.58.0505211004400.2206@ppc970.osdl.org>]
[parent not found: <7v4qcwihu1.fsf@assigned-by-dhcp.cox.net>]
* Now I think I am done with diff... [not found] ` <7v4qcwihu1.fsf@assigned-by-dhcp.cox.net> @ 2005-05-23 6:50 ` Junio C Hamano 0 siblings, 0 replies; 43+ messages in thread From: Junio C Hamano @ 2005-05-23 6:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "I" == Junio C Hamano <junkio@cox.net> said: JCH> Now I think I am done with diff,... I think I have fixed the diff-raw output routine enough to address the rename and copy distinction issues (although I chose not to give individual entries "this is a copy" or "this is a rename" indicators). And I thought I was really done with diff this time, but here are some leftover issues I would like to further address, not necessarily in this order, and not necessarily promising to do all of them. - Make "diff-raw human readable" output quite different from "diff-raw machine readable" format. I think human readable format should be more like "concise version of diff-patch format for human consumption", not the current "LF and TAB separated version of machine readable output", to enhance readability. This includes: (1) not repeating paths when src and dst are the same (i.e. most of the time); (2) substituting src/dst with /dev/null when talking about creation or deletion; (3) prefixing the src and dst with '+' or '*' or '-'. Currently I do not intend to add _parsing_ of such output to diff-helper because you can always use -z on both ends of the pipe (and as yourself said "Sure. Although I doubt people use the raw diff output except to (a) feed to diff-helper or (b) check that it's non-empty."), but I have to sleep on this. - Add an option to cause diff-tree three-brothers not to call diffcore_prune() at all. This will essentially produce something like a merge of two output from ls-files or ls-trees commands, but with renames/copies resolved. I do not know how useful this would be, especially now the immediate problem of diff_flush() pruning all the "non-change" entries is (hopefully) fixed. - Give pickaxe an option to cause it to produce diffs for the entire commit, not just the found file. - Experiment with the patch reordering I talked about in the previous message. - Add the support of "copied from post-edit image" detection in rename/copy logic. Currently I compare only with pre-edit image for modified files. - Read and understand the patch-delta.c, to make rename/copy similarity estimator more accurate. As you pointed out, it does very bad things on big consecutive deletes. - Think about performance issues around "diff-tree -C" stuff. - Think more about the problem with the patch output between two symbolic links, and how to fix it (I've been quietly thinking about this for some time ;-)). I think the current output of just comparing the two readlink results and generating a normal diff between them is broken and hazardous if such a thing is mistakenly fed to patch -p1. - Teach JIT about the diff-raw format changes. ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2005-05-26 3:15 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-05-19 10:32 [PATCH] Detect renames in diff family Junio C Hamano 2005-05-19 16:19 ` Linus Torvalds 2005-05-19 17:13 ` Junio C Hamano 2005-05-19 17:31 ` Linus Torvalds 2005-05-19 18:29 ` Nicolas Pitre 2005-05-19 18:46 ` Junio C Hamano 2005-05-19 18:58 ` Nicolas Pitre 2005-05-19 20:36 ` Junio C Hamano 2005-05-19 20:48 ` Nicolas Pitre 2005-05-19 21:44 ` Junio C Hamano 2005-05-19 22:26 ` Linus Torvalds 2005-05-19 23:32 ` Junio C Hamano 2005-05-20 0:10 ` Linus Torvalds 2005-05-20 0:48 ` Junio C Hamano 2005-05-19 18:47 ` Junio C Hamano 2005-05-19 18:42 ` Junio C Hamano 2005-05-19 18:55 ` Linus Torvalds 2005-05-19 19:53 ` Junio C Hamano 2005-05-19 20:12 ` H. Peter Anvin 2005-05-19 17:46 ` Joel Becker 2005-05-21 9:37 ` Junio C Hamano 2005-05-21 9:39 ` [PATCH 1/3] Diff overhaul, adding half of copy detection Junio C Hamano 2005-05-21 9:40 ` [PATCH 2/3] Introducing software archaeologist's tool "pickaxe" Junio C Hamano 2005-05-21 22:02 ` [PATCH] Constness fix for pickaxe option Junio C Hamano 2005-05-21 22:16 ` Linus Torvalds 2005-05-21 9:42 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Junio C Hamano 2005-05-21 10:11 ` [PATCH] Teach diff-tree to report unmodified paths for -C option Junio C Hamano 2005-05-21 10:51 ` Junio C Hamano 2005-05-21 17:25 ` [PATCH 3/3] Diff overhaul, adding the other half of copy detection Linus Torvalds 2005-05-21 18:10 ` [PATCH 3/3] Diff overhaul, adding the other half Junio C Hamano 2005-05-21 18:27 ` Linus Torvalds 2005-05-21 18:34 ` Linus Torvalds 2005-05-21 18:45 ` Linus Torvalds 2005-05-21 20:10 ` Junio C Hamano 2005-05-24 5:37 ` Junio C Hamano 2005-05-24 6:19 ` Linus Torvalds 2005-05-24 8:16 ` Junio C Hamano 2005-05-24 8:31 ` Linus Torvalds 2005-05-24 9:05 ` Junio C Hamano 2005-05-26 3:17 ` [RFC/PATCH] Detect copies harder in diff-tree Junio C Hamano 2005-05-21 19:54 ` [PATCH 3/3] Diff overhaul, adding the other half Junio C Hamano 2005-05-21 22:55 ` [PATCH] Tweak diffcore-rename heuristics Junio C Hamano [not found] ` <Pine.LNX.4.58.0505211004400.2206@ppc970.osdl.org> [not found] ` <7v4qcwihu1.fsf@assigned-by-dhcp.cox.net> 2005-05-23 6:50 ` Now I think I am done with diff Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).