From: "erik elfström" <erik.elfstrom@gmail.com>
To: git@vger.kernel.org
Subject: git clean performance issues
Date: Sat, 4 Apr 2015 20:32:45 +0200 [thread overview]
Message-ID: <CAMpP7NY++BwV+UygRj1C6Zsf=jE-z1AQuN3On0HeEqQpKGQtqw@mail.gmail.com> (raw)
Hi,
I'm having a performance issue with "git clean -qxfd" (note, not using
"-ff").
The performance issue shows up when trying to clean untracked
directories that themselves contain many sub directories. The
performance is highly non linear with the number of sub
directories. Some test numbers:
Dirs Time
10000 0m0.754s
50000 0m16.606s
100000 1m31.418s
When running "git clean -qxffd" (note, using "-ff") the performance is
fast and linear:
Dirs Time
10000 0m0.158s
50000 0m0.918s
100000 0m1.639s
After checking the source of git-clean my understanding of the problem
is as follows:
When clean.c:cmd_clean finds a directory and the "-d" flag is given it
will call clean.c:remove_dirs to potentially remove the directory and
all sub directories.
Unless "-ff" is given remove_dirs tries to be nice and not remove
directories containing other git repositories. To do this it does the
following check:
...
if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
!resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) {
...
The problem is that refs.c:resolve_gitlink_ref will call
refs.c:get_ref_cache that will linearly search a linked list of cache
entries and create and insert a new ref_cache entry in the list for
each path it is given if it fails to find an existing entry:
for (refs = submodule_ref_caches; refs; refs = refs->next)
if (!strcmp(submodule, refs->name))
return refs;
refs = create_ref_cache(submodule);
refs->next = submodule_ref_caches;
submodule_ref_caches = refs;
return refs;
In my scenario get_ref_cache will be called 10000+ times, each time
with a new path. The final few calls will need to search through and
compare 10000+ entries before realizing that there is no existing
entry. This quickly ads up to 100 million+ calls to strcmp().
>From what I can understand, the calls to get_ref_cache in this
scenario will never do any useful work. Is this correct? If so, would
it be possible to bypass it, maybe by calling
resolve_gitlink_ref_recursive directly or by using some other way of
checking for the presence of a git repo in clean.c:remove_dirs?
/Erik
next reply other threads:[~2015-04-04 18:32 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-04 18:32 erik elfström [this message]
2015-04-04 19:55 ` git clean performance issues Jeff King
2015-04-04 20:39 ` erik elfström
2015-04-04 20:48 ` Jeff King
2015-11-13 14:19 ` Andreas Krey
2015-11-13 23:53 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAMpP7NY++BwV+UygRj1C6Zsf=jE-z1AQuN3On0HeEqQpKGQtqw@mail.gmail.com' \
--to=erik.elfstrom@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).