* git clean performance issues
@ 2015-04-04 18:32 erik elfström
2015-04-04 19:55 ` Jeff King
0 siblings, 1 reply; 6+ messages in thread
From: erik elfström @ 2015-04-04 18:32 UTC (permalink / raw)
To: git
Hi,
I'm having a performance issue with "git clean -qxfd" (note, not using
"-ff").
The performance issue shows up when trying to clean untracked
directories that themselves contain many sub directories. The
performance is highly non linear with the number of sub
directories. Some test numbers:
Dirs Time
10000 0m0.754s
50000 0m16.606s
100000 1m31.418s
When running "git clean -qxffd" (note, using "-ff") the performance is
fast and linear:
Dirs Time
10000 0m0.158s
50000 0m0.918s
100000 0m1.639s
After checking the source of git-clean my understanding of the problem
is as follows:
When clean.c:cmd_clean finds a directory and the "-d" flag is given it
will call clean.c:remove_dirs to potentially remove the directory and
all sub directories.
Unless "-ff" is given remove_dirs tries to be nice and not remove
directories containing other git repositories. To do this it does the
following check:
...
if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
!resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) {
...
The problem is that refs.c:resolve_gitlink_ref will call
refs.c:get_ref_cache that will linearly search a linked list of cache
entries and create and insert a new ref_cache entry in the list for
each path it is given if it fails to find an existing entry:
for (refs = submodule_ref_caches; refs; refs = refs->next)
if (!strcmp(submodule, refs->name))
return refs;
refs = create_ref_cache(submodule);
refs->next = submodule_ref_caches;
submodule_ref_caches = refs;
return refs;
In my scenario get_ref_cache will be called 10000+ times, each time
with a new path. The final few calls will need to search through and
compare 10000+ entries before realizing that there is no existing
entry. This quickly ads up to 100 million+ calls to strcmp().
>From what I can understand, the calls to get_ref_cache in this
scenario will never do any useful work. Is this correct? If so, would
it be possible to bypass it, maybe by calling
resolve_gitlink_ref_recursive directly or by using some other way of
checking for the presence of a git repo in clean.c:remove_dirs?
/Erik
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git clean performance issues
2015-04-04 18:32 git clean performance issues erik elfström
@ 2015-04-04 19:55 ` Jeff King
2015-04-04 20:39 ` erik elfström
2015-11-13 14:19 ` Andreas Krey
0 siblings, 2 replies; 6+ messages in thread
From: Jeff King @ 2015-04-04 19:55 UTC (permalink / raw)
To: erik elfström; +Cc: git
On Sat, Apr 04, 2015 at 08:32:45PM +0200, erik elfström wrote:
> In my scenario get_ref_cache will be called 10000+ times, each time
> with a new path. The final few calls will need to search through and
> compare 10000+ entries before realizing that there is no existing
> entry. This quickly ads up to 100 million+ calls to strcmp().
>
> From what I can understand, the calls to get_ref_cache in this
> scenario will never do any useful work. Is this correct? If so, would
> it be possible to bypass it, maybe by calling
> resolve_gitlink_ref_recursive directly or by using some other way of
> checking for the presence of a git repo in clean.c:remove_dirs?
I think this is the same issue that was discussed here:
http://thread.gmane.org/gmane.comp.version-control.git/265560/focus=265585
There is some discussion of a possible fix in that thread. I was hoping
that Andreas was going to look further and produce a patch, but I
imagine he got busy with other things. Do you want to try picking it up?
-Peff
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git clean performance issues
2015-04-04 19:55 ` Jeff King
@ 2015-04-04 20:39 ` erik elfström
2015-04-04 20:48 ` Jeff King
2015-11-13 14:19 ` Andreas Krey
1 sibling, 1 reply; 6+ messages in thread
From: erik elfström @ 2015-04-04 20:39 UTC (permalink / raw)
To: Jeff King; +Cc: git
That looks like the same issue. The "use is_git_directory" approach
sounds good to me, is that the direction you would prefer? I can try
to cobble something together although I must warn you I have zero
previous experience with this code base so a few iterations will
probably be needed.
/Erik
On Sat, Apr 4, 2015 at 9:55 PM, Jeff King <peff@peff.net> wrote:
> On Sat, Apr 04, 2015 at 08:32:45PM +0200, erik elfström wrote:
>
>> In my scenario get_ref_cache will be called 10000+ times, each time
>> with a new path. The final few calls will need to search through and
>> compare 10000+ entries before realizing that there is no existing
>> entry. This quickly ads up to 100 million+ calls to strcmp().
>>
>> From what I can understand, the calls to get_ref_cache in this
>> scenario will never do any useful work. Is this correct? If so, would
>> it be possible to bypass it, maybe by calling
>> resolve_gitlink_ref_recursive directly or by using some other way of
>> checking for the presence of a git repo in clean.c:remove_dirs?
>
> I think this is the same issue that was discussed here:
>
> http://thread.gmane.org/gmane.comp.version-control.git/265560/focus=265585
>
> There is some discussion of a possible fix in that thread. I was hoping
> that Andreas was going to look further and produce a patch, but I
> imagine he got busy with other things. Do you want to try picking it up?
>
> -Peff
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git clean performance issues
2015-04-04 20:39 ` erik elfström
@ 2015-04-04 20:48 ` Jeff King
0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2015-04-04 20:48 UTC (permalink / raw)
To: erik elfström; +Cc: git
On Sat, Apr 04, 2015 at 10:39:47PM +0200, erik elfström wrote:
> That looks like the same issue. The "use is_git_directory" approach
> sounds good to me, is that the direction you would prefer? I can try
> to cobble something together although I must warn you I have zero
> previous experience with this code base so a few iterations will
> probably be needed.
Yeah, I think the preferred direction is building a solution in
is_git_directory. Multiple iterations are fine. That's what review is
for. :) See SubmittingPatches for tips, though.
-Peff
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git clean performance issues
2015-04-04 19:55 ` Jeff King
2015-04-04 20:39 ` erik elfström
@ 2015-11-13 14:19 ` Andreas Krey
2015-11-13 23:53 ` Jeff King
1 sibling, 1 reply; 6+ messages in thread
From: Andreas Krey @ 2015-11-13 14:19 UTC (permalink / raw)
To: Jeff King; +Cc: erik elfström, git
On Sat, 04 Apr 2015 15:55:07 +0000, Jeff King wrote:
...
> I think this is the same issue that was discussed here:
>
> http://thread.gmane.org/gmane.comp.version-control.git/265560/focus&5585
>
> There is some discussion of a possible fix in that thread. I was hoping
> that Andreas was going to look further and produce a patch, but I
> imagine he got busy with other things.
That about sums it up. However I now have a similar issue;
git ls-files shows the same behaviour (takes relatively
forever at 100% CPU), and runs instantly with my patch
from back then. Nothing seems to have changed, so I
may have another chance to look into this.
Andreas
--
"Totally trivial. Famous last words."
From: Linus Torvalds <torvalds@*.org>
Date: Fri, 22 Jan 2010 07:29:21 -0800
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git clean performance issues
2015-11-13 14:19 ` Andreas Krey
@ 2015-11-13 23:53 ` Jeff King
0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2015-11-13 23:53 UTC (permalink / raw)
To: Andreas Krey; +Cc: erik elfström, git
On Fri, Nov 13, 2015 at 03:19:07PM +0100, Andreas Krey wrote:
> On Sat, 04 Apr 2015 15:55:07 +0000, Jeff King wrote:
> ...
> > I think this is the same issue that was discussed here:
> >
> > http://thread.gmane.org/gmane.comp.version-control.git/265560/focus&5585
> >
> > There is some discussion of a possible fix in that thread. I was hoping
> > that Andreas was going to look further and produce a patch, but I
> > imagine he got busy with other things.
>
> That about sums it up. However I now have a similar issue;
> git ls-files shows the same behaviour (takes relatively
> forever at 100% CPU), and runs instantly with my patch
> from back then. Nothing seems to have changed, so I
> may have another chance to look into this.
Yeah, I think Erik's patch in 0179ca7 (clean: improve performance when
removing lots of directories, 2015-06-15) handles the git-clean case the
way we want to, but all of the other calls to resolve_gitlink_ref need
to be inspected and fixed similarly.
The one your are hitting with ls-files is probably in dir.c:treat_directory.
-Peff
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-11-13 23:53 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-04 18:32 git clean performance issues erik elfström
2015-04-04 19:55 ` Jeff King
2015-04-04 20:39 ` erik elfström
2015-04-04 20:48 ` Jeff King
2015-11-13 14:19 ` Andreas Krey
2015-11-13 23:53 ` Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).