All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Beller <sbeller@google.com>
To: sbeller@google.com
Cc: git@vger.kernel.org, gitster@pobox.com, jrnieder@gmail.com
Subject: [PATCH] Documentation/diff-options: explain different diff algorithms
Date: Fri, 10 Aug 2018 15:18:57 -0700	[thread overview]
Message-ID: <20180810221857.87399-1-sbeller@google.com> (raw)
In-Reply-To: <CAGZ79kZR_gj00JORH3WB_T+_mgtQm5PGt6+DSMFUbJM+C4FxVw@mail.gmail.com>

As a user I wondered what the diff algorithms are about. Offer at least
a basic explanation on the differences of the diff algorithms.

Signed-off-by: Stefan Beller <sbeller@google.com>
---

 Not sure if this is finished, I just want to put out the state that I
 have sitting on my disk.

 Documentation/diff-options.txt | 10 +++--
 Documentation/git-diff.txt     | 72 ++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index f394608b42c..00684b8936f 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience
 diff" algorithm internally.
 
 --diff-algorithm={patience|minimal|histogram|myers}::
-	Choose a diff algorithm. The variants are as follows:
+	Choose a diff algorithm. See the DIFF ALGORITHMS section
+ifndef::git-diff[]
+	in linkgit:git-diff[1]
+endif::git-diff[]
+	for more discussion. The variants are as follows:
 +
 --
 `default`, `myers`;;
 	The basic greedy diff algorithm. Currently, this is the default.
 `minimal`;;
-	Spend extra time to make sure the smallest possible diff is
-	produced.
+	The same algorithm as `myers`, but spend extra time to make
+	sure the smallest possible diff is produced.
 `patience`;;
 	Use "patience diff" algorithm when generating patches.
 `histogram`;;
diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt
index b180f1fa5bf..8837492ed05 100644
--- a/Documentation/git-diff.txt
+++ b/Documentation/git-diff.txt
@@ -119,6 +119,78 @@ include::diff-options.txt[]
 
 include::diff-format.txt[]
 
+DIFF ALGORITHMS
+---------------
+
+This section explains background on the diff algorithms. All of them
+operate on two input sequences of symbols. In Git each symbol is
+represented by a line of a file unless the option to diff based on
+words is given. The following diff algorithms are available:
+
+`Myers`
+
+A diff as produced by the basic greedy algorithm described in
+link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations].
+with a run time of O(M + N + D^2). To understand this algorithm, one
+can imagine a table spanned by the two input sequences with slides
+where there are the same symbols. For example the sequences 'ABCD' and 'ADB'
+the graph would look like
+
+	S | A | B | C | A
+	---------------------
+	A | \ |   |   | \ |
+	---------------------
+	D |   |   |   |   |
+	---------------------
+	B |   | \ |   |   |
+	---------------------
+	  |   |   |   |   |F
+
+and a greedy algorithm is used to find the cheapest path from start S to
+finish F, with each horizontal and vertical step having a cost of one and
+the diagonal slides having a cost of zero.
+
+This is simplified as the real algorithm only needs O(N+M) in terms of memory.
+In addition it employs a heuristic to allow for a faster diff at the small
+cost of diff size. The `minimal` algorithm has that heuristic turned off.
+
+`Minimal`
+The exact algorithm as described in the `Myers` paper without the heuristic
+that trades execution time for slightly worse diffs.
+
+`Patience`
+
+This algorithm by Bram Cohen originally for the bzr version control
+system matches the longest common subsequence of unique lines on
+both sides, recursively. It obtained its name by the way the longest
+subsequence is found, as that is a byproduct of the patience sorting
+algorithm. If there are no unique lines left it falls back to `myers`.
+Empirically this algorithm produces a more readable output for code,
+but it does not guarantee the shortest output.
+
+`Histogram`
+
+This algorithm by Shawn Pearce, originally implemented for
+JGit, finds the longest common substring and recursively
+diffs the content before and after the longest common substring.
+If there are no common substrings left, fall back to `myers`.
+This is often the fastest, but in corner cases (when there are
+many common substrings of the same length) it produces unexpected
+results as seen in:
+
+	seq 1 100 >one
+	echo 99 > two
+	seq 1 2 98 >>two
+	git diff --no-index --histogram one two
+
+
+Note how both `patience` and `histogram` use a concept that is abbreviated
+as 'LCS' (longest common subsequence and longest common substring).
+The longest common subsequence is a sequence of symbols that are found
+on both sides in the same order. The symbols do not need to be adjacent.
+The longest common substring is a sequence of adjacent symbols in order
+on both sides.
+
 EXAMPLES
 --------
 
-- 
2.18.0.865.gffc8e1a3cd6-goog


  reply	other threads:[~2018-08-10 22:19 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-24  0:36 [PATCH] Documentation/diff-options: explain different diff algorithms Stefan Beller
2018-07-24  4:40 ` Jonathan Nieder
2018-07-24 17:38   ` Stefan Beller
2018-07-24 20:06     ` Junio C Hamano
2018-08-06 22:25   ` Stefan Beller
2018-08-06 23:18     ` Jonathan Nieder
2018-08-07 15:56       ` Junio C Hamano
2018-08-09 19:26         ` Stefan Beller
2018-08-10 22:18           ` Stefan Beller [this message]
2018-08-09 19:51       ` Stefan Beller
2018-08-10  0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller
2018-08-10  0:10   ` [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs Stefan Beller
2018-08-10  0:10   ` [PATCH 2/2] WIP range-diff: print some statistics about the range Stefan Beller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180810221857.87399-1-sbeller@google.com \
    --to=sbeller@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.