From: Yann Dirson <ydirson@free.fr>
To: Jonathan Nieder <jrnieder@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH v4 1/4] Introduce wholesame directory move detection in diffcore.
Date: Thu, 7 Oct 2010 01:13:08 +0200 [thread overview]
Message-ID: <20101006231308.GY4983@home.lan> (raw)
In-Reply-To: <20101005010620.GC9994@burratino>
On Mon, Oct 04, 2010 at 08:06:20PM -0500, Jonathan Nieder wrote:
> > @@ -338,7 +339,8 @@ static int show_modified(struct rev_info *revs,
> >
> > oldmode = old->ce_mode;
> > if (mode == oldmode && !hashcmp(sha1, old->sha1) && !dirty_submodule &&
> > - !DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER))
> > + !DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER) &&
> > + !DIFF_OPT_TST(&revs->diffopt, DETECT_DIR_RENAMES))
> > return 0;
>
> The diff-index part.
>
> Would we also need something like v1.6.4-rc0~45^2 (Avoid "diff-index
> --cached" optimization under --find-copies-harder, 2009-05-22) in
> order to grok unchanged subtrees, or is that taken care of some other
> way?
Note to self: look at that
> [...]
> > --- a/diffcore-rename.c
> > +++ b/diffcore-rename.c
> > @@ -11,6 +11,7 @@
> > static struct diff_rename_dst {
> > struct diff_filespec *two;
> > struct diff_filepair *pair;
> > + int i_am_not_single:1; // does not look for a match, only here to be looked at
> > } *rename_dst;
>
> If we're looking for directory renames but not finding-copies-harder,
> then unmodified files, while interesting and deserving of dst entries,
> should not be considered candidates for tracing individual files.
>
> Yes? If so, maybe the less colorful
>
> unsigned rename_target_candidate:1;
>
> would make that clearer
Right - but maybe we can use a separate list for those non-candidates
as I suggested elsewhere, to make things even more clear.
> (and more importantly, the new kind of dst entry should be mentioned
> in the commit log message).
Hm, the commit message is a bit long already, isn't that too much detail ?
> > +static struct diff_rename_dst *locate_rename_dst_dir(struct diff_filespec *dir)
> > +{
> > + /* code mostly duplicated from locate_rename_dst - not sure we
> > + * could merge them efficiently,though
> > + */
> > + int first, last;
> > + int prefixlength = strlen(dir->path);
> > +
> > + first = 0;
> > + last = rename_dst_nr;
> > + while (last > first) {
> > + int next = (last + first) >> 1;
> > + struct diff_rename_dst *dst = &(rename_dst[next]);
> > + int cmp = strncmp(dir->path, dst->two->path, prefixlength);
>
> prefixcmp?
Since prefixlength is constant accross the loop, strncmp spares
comparing every char with \0 - we only do that once. I guess it's
worth keeping.
> > + if (!cmp)
> > + return dst;
> > + if (cmp < 0) {
> > + last = next;
> > + continue;
> > + }
> > + first = next+1;
> > + }
> > + /* not found */
> > + return NULL;
> > +}
>
> Hmm, so this is like locate_rename_dst(..., 0), except it matches
> prefixes. Result is a rename_dst entry within that directory (why
> not spend the time to find the first?).
We just don't need the first. In most cases, finding one entry just
invalidates the possiblity of a bulk move. Then when we want all of
them, we have to scan the list in both directions - and yes, a clean
way to avoid code duplication would be nice for those.
> >
> > /* Find the renames */
> > i = for_each_hash(&file_table, find_same_files);
> > @@ -414,6 +445,180 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
> > m[worst] = *o;
> > }
> >
> > +struct diff_dir_rename {
> > + struct diff_filespec *one;
> > + struct diff_filespec *two;
> > + int discarded;
> > + struct diff_dir_rename* next;
> > +};
>
> A linked list of renamed directories. What functions maintain it?
That is just the type decl. For this patch it is only used from
diffcore_factorize_renames() and diffcore_rename(), but the
--hide-dir-rename-details part needs it for maybe_mark_uninteresting,
which explains its position in the source file.
> [...]
> > +static struct diff_dir_rename* factorization_candidates = NULL;
> > +static void diffcore_factorize_renames(void)
> > +{
> > + char one_parent_path[PATH_MAX], two_parent_path[PATH_MAX];
> > + int i;
> > +
> > + for (i = 0; i < rename_dst_nr; i++) {
>
> For each rename target candidate: ...
>
> Would it make sense to give the body of this loop its own function?
Yes, that makes things more readable.
> > + struct diff_dir_rename* seen;
> > +
> > + // FIXME: what are those ?
> > + if (!rename_dst[i].pair)
> > + continue;
>
> Genuine new files are not interesting to us (since they do not
> provide evidence about directory renames, postive or negative).
Well, maybe except for Junio's idea of using bulk-rename detection to
bump rename scores (we may want to look at that afterwards, but I fear
it will require heavy surgery).
> > + // dummy renames used by copy detection
> > + if (!strcmp(rename_dst[i].pair->one->path, rename_dst[i].pair->two->path))
> > + continue;
>
> Whether a file was copied is probably interesting to us, but the
> corresponding filepair isn't.
Good point: bulk-rename can already be made to direct move+copy
detection to select the correct dst for the move.
> > +
> > + struct diff_filespec* one_parent = alloc_filespec(one_parent_path);
> > + fill_filespec(one_parent, null_sha1 /*FIXME*/, S_IFDIR);
>
> Initialize a filespec, for use in filepairs in case the containing
> directories are a source/target pair. (?)
We need a filespec for locate_rename_dst_dir, but it is silly, we can
just pass it a path string, and initialize the filespec when really
needed. Good catch.
> > +
> > + // After this commit, are there any files still under one_parent_path ?
> > + struct diff_rename_dst* one_leftover = locate_rename_dst_dir(one_parent);
> > + if (one_leftover) { // FIXME: should only be run if !seen
> [...]
> > + }
>
> A candidate for a separate function, no?
Right.
> The idea, if I understand it: too many unchanged files can
> disqualify a rename source.
Not exactly: any file left in a dir disqualifies this dir for bulk-rename.
> > +
> > + // already considered ?
> > + for (seen=factorization_candidates; seen; seen = seen->next)
> > + if (!strcmp(seen->one->path, one_parent_path)) break;
>
> Lookup. A candidate for a separate function, I think.
>
> > + if (!seen) {
> > + // record potential dir rename
> > + seen = xmalloc(sizeof(*seen));
> > + seen->one = one_parent;
> > + seen->two = alloc_filespec(two_parent_path);
> > + fill_filespec(seen->two, null_sha1 /*FIXME*/, S_IFDIR);
> > + seen->discarded = 0;
> > + seen->next = factorization_candidates;
> > + factorization_candidates = seen;
> > + fprintf(stderr, "[DBG] %s -> %s suggests possible rename from %s to %s\n",
> > + rename_dst[i].pair->one->path,
> > + rename_dst[i].pair->two->path,
> > + one_parent_path, two_parent_path);
>
> Insertion. Likewise.
Hm, it's only used once, and once the other splits are done the func
fits on a screen (at least on my display ;)
OTOH, if we decide for a sorted list, to allow for binary search to be
more efficient, then yes.
> > +
> > + /* all checks ok, we keep that entry */
> > + }
> > +
> > + return;
> > +}
> > +
> > void diffcore_rename(struct diff_options *options)
> > {
> > int detect_rename = options->detect_rename;
> > @@ -451,13 +656,22 @@ void diffcore_rename(struct diff_options *options)
> [...]
> > if (detect_rename == DIFF_DETECT_COPY) {
> > /*
> > * Increment the "rename_used" score by
> > * one, to indicate ourselves as a user.
> > */
> > p->one->rename_used++;
> > register_rename_src(p->one, p->score);
> > }
> > + if (DIFF_OPT_TST(options, DETECT_DIR_RENAMES)) {
> > + /* similarly, rename factorization needs to
> > + * see all files from second tree, but we don't
> > + * want them to be matched against single sources.
> > + */
> > + locate_rename_dst(p->two, 1)->i_am_not_single = 1;
>
> ... except when --find-copies-harder is being used, right?
I have not investigated interactions with copy detection yet.
> > @@ -569,7 +785,28 @@ void diffcore_rename(struct diff_options *options)
> > /* At this point, we have found some renames and copies and they
> > * are recorded in rename_dst. The original list is still in *q.
> > */
> > +
> > + /* Now possibly factorize those renames and copies. */
> > + if (DIFF_OPT_TST(options, DETECT_DIR_RENAMES))
> > + diffcore_factorize_renames();
> > +
> > DIFF_QUEUE_CLEAR(&outq);
> > +
> > + // Now turn non-discarded factorization_candidates into real renames
> > + struct diff_dir_rename* candidate;
> > + for (candidate=factorization_candidates; candidate; candidate = candidate->next) {
> > + struct diff_filepair* pair;
> > + if (candidate->discarded) continue;
> > + // visualize toplevel dir if needed - FIXME: wrong place for this ?
> > + if (!*candidate->one->path)
> > + candidate->one->path = "./";
> > + if (!*candidate->two->path)
> > + candidate->two->path = "./";
> > + pair = diff_queue(&outq, candidate->one, candidate->two);
> > + pair->score = MAX_SCORE;
> > + pair->renamed_pair = 1;
>
> Outputting the discovered directory renames.
Not really outputting, just queueing for ouput here.
> Conclusions:
>
> - the basic idea looks sane
> - the main function would benefit a lot from being split up a bit
> - would be nice to have an overview of the design (especially, a
> quick description of the heuristics used) for the commit message
Thanks for this detailed review. Pushing to
http://repo.or.cz/w/git/ydirson.git a set of patches addressing those
remarks and others from this thread.
next prev parent reply other threads:[~2010-10-06 23:03 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-03 20:42 [RFC PATCH v4 0/4] Detection of directory renames Yann Dirson
2010-10-03 20:42 ` [PATCH v4 1/4] Introduce wholesame directory move detection in diffcore Yann Dirson
2010-10-03 20:42 ` [PATCH v4 2/4] Add testcases for the --detect-dir-renames diffcore flag Yann Dirson
2010-10-03 20:42 ` [PATCH v4 3/4] Allow hiding renames of individual files involved in a directory rename Yann Dirson
2010-10-03 20:42 ` [PATCH v4 4/4] Add testcases for the --hide-dir-rename-details diffcore flag Yann Dirson
2010-10-03 23:04 ` Sverre Rabbelier
2010-10-03 23:06 ` [PATCH v4 3/4] Allow hiding renames of individual files involved in a directory rename Sverre Rabbelier
2010-10-03 23:28 ` Junio C Hamano
2010-10-04 6:43 ` Sverre Rabbelier
2010-10-04 18:21 ` Yann Dirson
2010-10-04 3:03 ` [PATCH v4 2/4] Add testcases for the --detect-dir-renames diffcore flag Ævar Arnfjörð Bjarmason
2010-10-04 18:32 ` Yann Dirson
2010-10-04 20:32 ` Jonathan Nieder
2010-10-04 21:37 ` Yann Dirson
2010-10-04 22:09 ` Jonathan Nieder
2010-10-05 9:21 ` Andreas Ericsson
2010-10-04 2:59 ` [PATCH v4 1/4] Introduce wholesame directory move detection in diffcore Ævar Arnfjörð Bjarmason
2010-10-04 18:19 ` Yann Dirson
2010-10-04 7:28 ` Jonathan Nieder
2010-10-04 21:13 ` Yann Dirson
2010-10-05 1:06 ` Jonathan Nieder
2010-10-06 23:13 ` Yann Dirson [this message]
2010-10-04 6:20 ` [RFC PATCH v4 0/4] Detection of directory renames Jonathan Nieder
2010-10-05 1:42 ` Jonathan Nieder
2010-10-06 23:17 ` Yann Dirson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101006231308.GY4983@home.lan \
--to=ydirson@free.fr \
--cc=git@vger.kernel.org \
--cc=jrnieder@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.