From: Stefan Beller <sbeller@google.com>
To: sbeller@google.com
Cc: mhagger@alum.mit.edu, git@vger.kernel.org, jamill@microsoft.com,
mh@glandium.org
Subject: [PATCH] xdiff: reduce indent heuristic overhead
Date: Fri, 29 Jun 2018 13:28:11 -0700 [thread overview]
Message-ID: <20180629202811.131265-1-sbeller@google.com> (raw)
In-Reply-To: <CAGZ79kb0FOafEsuXU7c_BTwPtcujFeyWVhzSuzFHRFtQHp9weQ@mail.gmail.com>
This patch was written originally for mercurial at
https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
changeset: 36674:c420792217c8
user: Jun Wu <quark@fb.com>
date: Sat Mar 03 12:39:11 2018 -0800
files: mercurial/thirdparty/xdiff/xdiffi.c
description:
xdiff: reduce indent heuristic overhead
Adds some threshold to avoid expensive cases, like:
```
#!python
open('a', 'w').write(" \n" * 1000000)
open('b', 'w').write(" \n" * 1000001)
```
The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
makes diff much slower.
Before this patch (system git 2.14.2):
```
git diff --no-indent-heuristic a b 0.21s user 0.03s system 100% cpu 0.239 total
git diff --indent-heuristic a b 0.77s user 0.02s system 99% cpu 0.785 total
```
After this patch (git 2fc74f41, with xdiffi.c patched):
```
# with the changed xdiffi.c
git diff --indent-heuristic a b 0.16s user 0.01s system 90% cpu 0.188 total
git diff --no-indent-heuristic a b 0.18s user 0.01s system 99% cpu 0.192 total
```
Now turning on indent-heuristic has no visible impact on performance.
Differential Revision: https://phab.mercurial-scm.org/D2601
Signed-off-by: Stefan Beller <sbeller@google.com>
---
This applies on our master branch, I have not thought of a
good commit message or if we need to test it.
Thanks,
Stefan
xdiff/xdiffi.c | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0de1ef463bf..c74ec77da58 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -807,6 +807,14 @@ static void xdl_bug(const char *msg)
exit(1);
}
+/*
+ * For indentation heuristic, skip searching for better slide position after
+ * checking MAX_BORING lines without finding an improvement. This defends the
+ * indentation heuristic logic against pathological cases. The value is not
+ * picked scientifically but should be good enough.
+ */
+#define MAX_BORING 100
+
/*
* Move back and forward change groups for a consistent and pretty diff output.
* This also helps in finding joinable change groups and reducing the diff
@@ -903,19 +911,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
long shift, best_shift = -1;
struct split_score best_score;
- for (shift = earliest_end; shift <= g.end; shift++) {
+ /*
+ * This is O(N * MAX_BLANKS) (N = shift-able lines).
+ * Even with MAX_BLANKS bounded to a small value, a
+ * large N could still make this loop take several
+ * times longer than the main diff algorithm. The
+ * "boring" value is to help cut down N to something
+ * like (MAX_BORING + groupsize).
+ *
+ * Scan from bottom to top. So we can exit the loop
+ * without compromising the assumption "for a same best
+ * score, pick the bottommost shift".
+ */
+ int boring = 0;
+ for (shift = g.end; shift >= earliest_end; shift--) {
struct split_measurement m;
struct split_score score = {0, 0};
+ int cmp;
measure_split(xdf, shift, &m);
score_add_split(&m, &score);
measure_split(xdf, shift - groupsize, &m);
score_add_split(&m, &score);
- if (best_shift == -1 ||
- score_cmp(&score, &best_score) <= 0) {
+
+ if (best_shift == -1) {
+ cmp = -1;
+ } else {
+ cmp = score_cmp(&score, &best_score);
+ }
+ if (cmp < 0) {
+ boring = 0;
best_score.effective_indent = score.effective_indent;
best_score.penalty = score.penalty;
best_shift = shift;
+ } else {
+ boring += 1;
+ if (boring >= MAX_BORING)
+ break;
}
}
--
2.18.0.399.gad0ab374a1-goog
next prev parent reply other threads:[~2018-06-29 20:28 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-29 9:44 fast-import slowness when importing large files with small differences Mike Hommey
2018-06-29 20:14 ` Stefan Beller
2018-06-29 20:28 ` Stefan Beller [this message]
2018-06-29 21:17 ` [PATCH] xdiff: reduce indent heuristic overhead Junio C Hamano
2018-06-29 23:37 ` Stefan Beller
2018-06-30 1:11 ` Jun Wu
2018-07-01 15:57 ` Michael Haggerty
2018-07-02 17:27 ` Stefan Beller
2018-07-03 9:15 ` Michael Haggerty
2018-07-27 22:23 ` Stefan Beller
2018-07-03 18:14 ` Junio C Hamano
2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 20:51 ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35 ` Mike Hommey
2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38 ` Mike Hommey
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180629202811.131265-1-sbeller@google.com \
--to=sbeller@google.com \
--cc=git@vger.kernel.org \
--cc=jamill@microsoft.com \
--cc=mh@glandium.org \
--cc=mhagger@alum.mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).