From: Stefan Beller <sbeller@google.com>
To: sbeller@google.com
Cc: git@vger.kernel.org, gitster@pobox.com, jrnieder@gmail.com
Subject: [PATCH] Documentation/diff-options: explain different diff algorithms
Date: Fri, 10 Aug 2018 15:18:57 -0700 [thread overview]
Message-ID: <20180810221857.87399-1-sbeller@google.com> (raw)
In-Reply-To: <CAGZ79kZR_gj00JORH3WB_T+_mgtQm5PGt6+DSMFUbJM+C4FxVw@mail.gmail.com>
As a user I wondered what the diff algorithms are about. Offer at least
a basic explanation on the differences of the diff algorithms.
Signed-off-by: Stefan Beller <sbeller@google.com>
---
Not sure if this is finished, I just want to put out the state that I
have sitting on my disk.
Documentation/diff-options.txt | 10 +++--
Documentation/git-diff.txt | 72 ++++++++++++++++++++++++++++++++++
2 files changed, 79 insertions(+), 3 deletions(-)
diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index f394608b42c..00684b8936f 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience
diff" algorithm internally.
--diff-algorithm={patience|minimal|histogram|myers}::
- Choose a diff algorithm. The variants are as follows:
+ Choose a diff algorithm. See the DIFF ALGORITHMS section
+ifndef::git-diff[]
+ in linkgit:git-diff[1]
+endif::git-diff[]
+ for more discussion. The variants are as follows:
+
--
`default`, `myers`;;
The basic greedy diff algorithm. Currently, this is the default.
`minimal`;;
- Spend extra time to make sure the smallest possible diff is
- produced.
+ The same algorithm as `myers`, but spend extra time to make
+ sure the smallest possible diff is produced.
`patience`;;
Use "patience diff" algorithm when generating patches.
`histogram`;;
diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt
index b180f1fa5bf..8837492ed05 100644
--- a/Documentation/git-diff.txt
+++ b/Documentation/git-diff.txt
@@ -119,6 +119,78 @@ include::diff-options.txt[]
include::diff-format.txt[]
+DIFF ALGORITHMS
+---------------
+
+This section explains background on the diff algorithms. All of them
+operate on two input sequences of symbols. In Git each symbol is
+represented by a line of a file unless the option to diff based on
+words is given. The following diff algorithms are available:
+
+`Myers`
+
+A diff as produced by the basic greedy algorithm described in
+link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations].
+with a run time of O(M + N + D^2). To understand this algorithm, one
+can imagine a table spanned by the two input sequences with slides
+where there are the same symbols. For example the sequences 'ABCD' and 'ADB'
+the graph would look like
+
+ S | A | B | C | A
+ ---------------------
+ A | \ | | | \ |
+ ---------------------
+ D | | | | |
+ ---------------------
+ B | | \ | | |
+ ---------------------
+ | | | | |F
+
+and a greedy algorithm is used to find the cheapest path from start S to
+finish F, with each horizontal and vertical step having a cost of one and
+the diagonal slides having a cost of zero.
+
+This is simplified as the real algorithm only needs O(N+M) in terms of memory.
+In addition it employs a heuristic to allow for a faster diff at the small
+cost of diff size. The `minimal` algorithm has that heuristic turned off.
+
+`Minimal`
+The exact algorithm as described in the `Myers` paper without the heuristic
+that trades execution time for slightly worse diffs.
+
+`Patience`
+
+This algorithm by Bram Cohen originally for the bzr version control
+system matches the longest common subsequence of unique lines on
+both sides, recursively. It obtained its name by the way the longest
+subsequence is found, as that is a byproduct of the patience sorting
+algorithm. If there are no unique lines left it falls back to `myers`.
+Empirically this algorithm produces a more readable output for code,
+but it does not guarantee the shortest output.
+
+`Histogram`
+
+This algorithm by Shawn Pearce, originally implemented for
+JGit, finds the longest common substring and recursively
+diffs the content before and after the longest common substring.
+If there are no common substrings left, fall back to `myers`.
+This is often the fastest, but in corner cases (when there are
+many common substrings of the same length) it produces unexpected
+results as seen in:
+
+ seq 1 100 >one
+ echo 99 > two
+ seq 1 2 98 >>two
+ git diff --no-index --histogram one two
+
+
+Note how both `patience` and `histogram` use a concept that is abbreviated
+as 'LCS' (longest common subsequence and longest common substring).
+The longest common subsequence is a sequence of symbols that are found
+on both sides in the same order. The symbols do not need to be adjacent.
+The longest common substring is a sequence of adjacent symbols in order
+on both sides.
+
EXAMPLES
--------
--
2.18.0.865.gffc8e1a3cd6-goog
next prev parent reply other threads:[~2018-08-10 22:19 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-24 0:36 [PATCH] Documentation/diff-options: explain different diff algorithms Stefan Beller
2018-07-24 4:40 ` Jonathan Nieder
2018-07-24 17:38 ` Stefan Beller
2018-07-24 20:06 ` Junio C Hamano
2018-08-06 22:25 ` Stefan Beller
2018-08-06 23:18 ` Jonathan Nieder
2018-08-07 15:56 ` Junio C Hamano
2018-08-09 19:26 ` Stefan Beller
2018-08-10 22:18 ` Stefan Beller [this message]
2018-08-09 19:51 ` Stefan Beller
2018-08-10 0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller
2018-08-10 0:10 ` [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs Stefan Beller
2018-08-10 0:10 ` [PATCH 2/2] WIP range-diff: print some statistics about the range Stefan Beller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180810221857.87399-1-sbeller@google.com \
--to=sbeller@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jrnieder@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).