git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git clean performance issues
@ 2015-04-04 18:32 erik elfström
  2015-04-04 19:55 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: erik elfström @ 2015-04-04 18:32 UTC (permalink / raw)
  To: git

Hi,

I'm having a performance issue with "git clean -qxfd" (note, not using
"-ff").

The performance issue shows up when trying to clean untracked
directories that themselves contain many sub directories. The
performance is highly non linear with the number of sub
directories. Some test numbers:

Dirs    Time
10000   0m0.754s
50000   0m16.606s
100000  1m31.418s

When running "git clean -qxffd" (note, using "-ff") the performance is
fast and linear:

Dirs    Time
10000   0m0.158s
50000   0m0.918s
100000  0m1.639s

After checking the source of git-clean my understanding of the problem
is as follows:

When clean.c:cmd_clean finds a directory and the "-d" flag is given it
will call clean.c:remove_dirs to potentially remove the directory and
all sub directories.

Unless "-ff" is given remove_dirs tries to be nice and not remove
directories containing other git repositories. To do this it does the
following check:

    ...
    if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
            !resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) {
    ...

The problem is that refs.c:resolve_gitlink_ref will call
refs.c:get_ref_cache that will linearly search a linked list of cache
entries and create and insert a new ref_cache entry in the list for
each path it is given if it fails to find an existing entry:

    for (refs = submodule_ref_caches; refs; refs = refs->next)
        if (!strcmp(submodule, refs->name))
            return refs;

    refs = create_ref_cache(submodule);
    refs->next = submodule_ref_caches;
    submodule_ref_caches = refs;
    return refs;

In my scenario get_ref_cache will be called 10000+ times, each time
with a new path. The final few calls will need to search through and
compare 10000+ entries before realizing that there is no existing
entry. This quickly ads up to 100 million+ calls to strcmp().

>From what I can understand, the calls to get_ref_cache in this
scenario will never do any useful work. Is this correct? If so, would
it be possible to bypass it, maybe by calling
resolve_gitlink_ref_recursive directly or by using some other way of
checking for the presence of a git repo in clean.c:remove_dirs?

/Erik

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-11-13 23:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-04 18:32 git clean performance issues erik elfström
2015-04-04 19:55 ` Jeff King
2015-04-04 20:39   ` erik elfström
2015-04-04 20:48     ` Jeff King
2015-11-13 14:19   ` Andreas Krey
2015-11-13 23:53     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).