From: Stefan Beller <sbeller@google.com>
To: Mike Hommey <mh@glandium.org>, Jameson Miller <jamill@microsoft.com>
Cc: git <git@vger.kernel.org>
Subject: Re: fast-import slowness when importing large files with small differences
Date: Fri, 29 Jun 2018 13:14:52 -0700 [thread overview]
Message-ID: <CAGZ79kb0FOafEsuXU7c_BTwPtcujFeyWVhzSuzFHRFtQHp9weQ@mail.gmail.com> (raw)
In-Reply-To: <20180629094413.bgltep6ntlza6vhz@glandium.org>
On Fri, Jun 29, 2018 at 3:18 AM Mike Hommey <mh@glandium.org> wrote:
>
> Hi,
>
> I noticed some slowness when fast-importing data from the Firefox mercurial
> repository, where fast-import spends more than 5 minutes importing ~2000
> revisions of one particular file. I reduced a testcase while still
> using real data. One could synthesize data with kind of the same
> properties, but I figured real data could be useful.
I cc'd Jameson, who refactored memory allocation in fast-import recently.
(I am not aware of other refactorings in the area of fast-import)
> To reproduce:
[...]
> Memory total: 2282 KiB
> pools: 2048 KiB
> objects: 234 KiB
>
[...]
> Obviously, sha1'ing 26GB is not going to be free, but it's also not the
> dominating cost, according to perf:
>
> 63.52% git-fast-import git-fast-import [.] create_delta_index
So this doesn't sound like a memory issue, but a diffing/deltaing issue.
> So maybe it would make sense to consolidate the diff code (after all,
> diff-delta.c is an old specialized fork of xdiff). With manual trimming
> of common head and tail, this gets down to 3:33.
This sounds interesting. I'd love to see that code to be unified.
> I'll also note that Facebook has imported xdiff from the git code base
> into mercurial and improved performance on it, so it might also be worth
> looking at what's worth taking from there.
So starting with
https://www.mercurial-scm.org/repo/hg/rev/34e2ff1f9cd8
("xdiff: vendor xdiff library from git")
they adapted it slightly:
$ hg log --template '{node|short} {desc|firstline}\n' --
mercurial/thirdparty/xdiff/
a2baa61bbb14 xdiff: move stdint.h to xdiff.h
d40b9e29c114 xdiff: fix a hard crash on Windows
651c80720eed xdiff: silence a 32-bit shift warning on Windows
d255744de97a xdiff: backport int64_t and uint64_t types to Windows
e5b14f5b8b94 xdiff: resolve signed unsigned comparison warning
f1ef0e53e628 xdiff: use int64 for hash table size
f0d9811dda8e xdiff: remove unused xpp and xecfg parameters
49fe6249937a xdiff: remove unused flags parameter
882657a9f768 xdiff: replace {unsigned ,}long with {u,}int64_t
0c7350656f93 xdiff: add comments for fields in xdfile_t
f33a87cf60cc xdiff: add a preprocessing step that trims files
3cf40112efb7 xdiff: remove xmerge related logic
90f8fe72446c xdiff: remove xemit related logic
b5bb0f99064d xdiff: remove unused structure, functions, and constants
09f320067591 xdiff: remove whitespace related feature
1f9bbd1d6b8a xdiff: fix builds on Windows
c420792217c8 xdiff: reduce indent heuristic overhead
b3c9c483cac9 xdiff: add a bdiff hunk mode
9e7b14caf67f xdiff: remove patience and histogram diff algorithms
34e2ff1f9cd8 xdiff: vendor xdiff library from git
Interesting pieces regarding performance:
c420792217c8 xdiff: reduce indent heuristic overhead
https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
f33a87cf60cc xdiff: add a preprocessing step that trims files
https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6
I'll see if I can make that into patches.
Thanks,
Stefan
next prev parent reply other threads:[~2018-06-29 20:15 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-29 9:44 fast-import slowness when importing large files with small differences Mike Hommey
2018-06-29 20:14 ` Stefan Beller [this message]
2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 21:17 ` Junio C Hamano
2018-06-29 23:37 ` Stefan Beller
2018-06-30 1:11 ` Jun Wu
2018-07-01 15:57 ` Michael Haggerty
2018-07-02 17:27 ` Stefan Beller
2018-07-03 9:15 ` Michael Haggerty
2018-07-27 22:23 ` Stefan Beller
2018-07-03 18:14 ` Junio C Hamano
2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 20:51 ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35 ` Mike Hommey
2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38 ` Mike Hommey
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGZ79kb0FOafEsuXU7c_BTwPtcujFeyWVhzSuzFHRFtQHp9weQ@mail.gmail.com \
--to=sbeller@google.com \
--cc=git@vger.kernel.org \
--cc=jamill@microsoft.com \
--cc=mh@glandium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).