All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Elijah Newren <newren@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Elijah Newren <newren@gmail.com>
Subject: [PATCH v2 0/4] Optimization batch 7: use file basenames to guide rename detection
Date: Tue, 09 Feb 2021 11:32:02 +0000	[thread overview]
Message-ID: <pull.843.v2.git.1612870326.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.843.git.1612651937.gitgitgadget@gmail.com>

This series depends on ort-perf-batch-6[1].

This series uses file basenames in a basic fashion to guide rename
detection.

Changes since v1:

 * Significant edits to the commit messages to make them explain the basic
   idea better and (hopefully) prevent misunderstandings.
 * Add a commit at the end that updates the documentation and includes a new
   testcase
 * Modify the code to make it clearer that it can already handle using a
   different score for basename comparison similarity, even if we don't use
   that ability yet (see below)

Changes not yet included -- I need input about what is wanted:

 * Stolee suggested not creating a separate score for basename comparisons,
   at least not yet. Junio suggested it may be prudent to use a higher score
   for that than whatever -M option the user provided for normal
   comparisons...but didn't suggest whether it should be a separate
   user-specified option, or some kind of weighted average of the -M option
   and MAX_SCORE (e.g. use 60% if -M is 50%, or use 80% if -M is 75%). I
   tweaked the code to make it clearer that it already is able to handle
   such a score difference, but I'm not sure what whether we want an
   automatically computed higher value or a user-controlled possibly higher
   value.

[1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@gitster.c.googlers.com/ [2]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (4):
  diffcore-rename: compute basenames of all source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames
  gitdiffcore doc: mention new preliminary step for rename detection

 Documentation/gitdiffcore.txt |  15 +++
 diffcore-rename.c             | 185 +++++++++++++++++++++++++++++++++-
 t/t4001-diff-rename.sh        |  24 +++++
 3 files changed, 220 insertions(+), 4 deletions(-)


base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/843

Range-diff vs v1:

 1:  377a4a39fa86 ! 1:  381a45d239bb diffcore-rename: compute basenames of all source and dest candidates
     @@ Commit message
          diffcore-rename: compute basenames of all source and dest candidates
      
          We want to make use of unique basenames to help inform rename detection,
     -    so that more likely pairings can be checked first.  Add a new function,
     +    so that more likely pairings can be checked first.  (src/moduleA/foo.txt
     +    and source/module/A/foo.txt are likely related if there are no other
     +    'foo.txt' files among the deleted and added files.)  Add a new function,
          not yet used, which creates a map of the unique basenames within
          rename_src and another within rename_dst, together with the indices
          within rename_src/rename_dst where those basenames show up.  Non-unique
          basenames still show up in the map, but have an invalid index (-1).
      
     +    This function was inspired by the fact that in real world repositories,
     +    most renames often do not involve a basename change.  Here are some
     +    sample repositories and the percentage of their historical renames (as of
     +    early 2020) that did not involve a basename change:
     +      * linux: 76%
     +      * gcc: 64%
     +      * gecko: 79%
     +      * webkit: 89%
     +
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 2:  5fb4493247ff ! 2:  dcd0175229aa diffcore-rename: complete find_basename_matches()
     @@ Commit message
          sufficiently similar, we record the rename; if not, we include those
          files in the more exhaustive matrix comparison.
      
     +    This means we are adding a set of preliminary additional comparisons,
     +    but for each file we only compare it with at most one other file.  For
     +    example, if there was a include/media/device.h that was deleted and a
     +    src/module/media/device.h that was added, and there were no other
     +    device.h files added or deleted between the commits being compared,
     +    then these two files would be compared in the preliminary step.
     +
     +    This commit does not yet actually employ this new optimization, it
     +    merely adds a function which can be used for this purpose.  The next
     +    commit will do the necessary plumbing to make use of it.
     +
          Note that this optimization might give us different results than without
          the optimization, because it's possible that despite files with the same
          basename being sufficiently similar to be considered a rename, there's
          an even better match between files without the same basename.  I think
     -    that is okay for four reasons: (1) That seems somewhat unlikely in
     -    practice, (2) it's easy to explain to the users what happened if it does
     -    ever occur (or even for them to intuitively figure out), and (3) as the
     -    next patch will show it provides such a large performance boost that
     -    it's worth the tradeoff.  Reason (4) takes a full paragraph to
     +    that is okay for four reasons: (1) it's easy to explain to the users
     +    what happened if it does ever occur (or even for them to intuitively
     +    figure out), (2) as the next patch will show it provides such a large
     +    performance boost that it's worth the tradeoff, and (3) it's somewhat
     +    unlikely that despite having unique matching basenames that other files
     +    serve as better matches.  Reason (4) takes a full paragraph to
          explain...
      
          If the previous three reasons aren't enough, consider what rename
     @@ Commit message
          basename and are sufficiently similar to be considered a rename, mark
          them as such without comparing the two to all other rename candidates.
      
     -    We do not use this heuristic together with either break or copy
     -    detection.  The point of break detection is to say that filename
     -    similarity does not imply file content similarity, and we only want to
     -    know about file content similarity.  The point of copy detection is to
     -    use more resources to check for additional similarities, while this is
     -    an optimization that uses far less resources but which might also result
     -    in finding slightly fewer similarities.  So the idea behind this
     -    optimization goes against both of those features, and will be turned off
     -    for both.
     -
     -    Note that this optimization is not yet effective in any situation, as
     -    the function is still unused.  The next commit will hook it into the
     -    code so that it is used when rename detection is wanted, but neither
     -    copy nor break detection are.
     -
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 3:  1d941c35076e ! 3:  ce2173aa1fb7 diffcore-rename: guide inexact rename detection based on basenames
     @@ Commit message
      
          Make use of the new find_basename_matches() function added in the last
          two patches, to find renames more rapidly in cases where we can match up
     -    files based on basenames.
     +    files based on basenames.  As a quick reminder (see the last two commit
     +    messages for more details), this means for example that
     +    docs/extensions.txt and docs/config/extensions.txt are considered likely
     +    renames if there are no 'extensions.txt' files elsewhere among the added
     +    and deleted files, and if a similarity check confirms they are similar,
     +    then they are marked as a rename without looking for a better similarity
     +    match among other files.  This is a behavioral change, as covered in
     +    more detail in the previous commit message.
     +
     +    We do not use this heuristic together with either break or copy
     +    detection.  The point of break detection is to say that filename
     +    similarity does not imply file content similarity, and we only want to
     +    know about file content similarity.  The point of copy detection is to
     +    use more resources to check for additional similarities, while this is
     +    an optimization that uses far less resources but which might also result
     +    in finding slightly fewer similarities.  So the idea behind this
     +    optimization goes against both of those features, and will be turned off
     +    for both.
      
          For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
          performance work; instrument with trace2_region_* calls", 2020-10-28),
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		remove_unneeded_paths_from_src(want_copies);
      +		trace2_region_leave("diff", "cull after exact", options->repo);
      +	} else {
     ++		/* Determine minimum score to match basenames */
     ++		int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
     ++
      +		/*
      +		 * Cull sources:
      +		 *   - remove ones involved in renames (found via exact match)
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +
      +		/* Utilize file basenames to quickly find renames. */
      +		trace2_region_enter("diff", "basename matches", options->repo);
     -+		rename_count += find_basename_matches(options, minimum_score,
     ++		rename_count += find_basename_matches(options,
     ++						      min_basename_score,
      +						      rename_src_nr);
      +		trace2_region_leave("diff", "basename matches", options->repo);
      +
 -:  ------------ > 4:  a0e75d8cd6bd gitdiffcore doc: mention new preliminary step for rename detection

-- 
gitgitgadget

  parent reply	other threads:[~2021-02-09 11:46 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` Elijah Newren via GitGitGadget [this message]
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.843.v2.git.1612870326.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.