Strange O(N^3) behavior in "git filter-branch"

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Strange O(N^3) behavior in "git filter-branch"
@ 2011-07-14  7:16 Michael Haggerty
  2011-07-14  9:24 ` Michael Haggerty
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Haggerty @ 2011-07-14  7:16 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 4782 bytes --]

I have noticed that "git filter-branch" gets pathologically slow when it
operates on a repository that has many references in a complicated
directory hierarchy.  The time seems to go like O(N^3), where N is the
number of references being rewritten.

I have attached a Python program that I have been using to explore this
behavior.  It creates a simple linear history with hardly any content
but one reference to each commit, then runs "git-filter-branch" on all
branches using a trivial tree-filter.

The times were taken using the "maint" version of git on a notebook
computer with 4Gb of RAM running Linux 2.6.32.  The RAM usage never
seemed to expand unusually.

The part that slows down is the part near the end where it writes

    Ref 'foo' was rewritten
    Ref 'bar' was rewritten
    etc.

These lines come at ever-greater intervals and dominate the time for
running git-filter-branch for N more than 1000 or so.  The time taken
for each iteration of this loop increases quadratically, so the time for
the whole run goes like N^3.  The time that the Nth line appears goes
approximately like

    254.954 + 0.0195578*N + 3.38276e-05*N^2 + 3.45833e-08*N^3

; and the N^3 term starts to dominate around N=1000.  The 1000th step
takes about 270 ms.

To figure out where the time is going, I instrumented git-filter-branch
as shown below, and found the blocks between "1" and "2", "4b" and "5",
and "5" and 6" each take about 90 ms (and therefore almost all of the time).

The slowdown seems to depend crucially on having a complicated hierarchy
of references.  The bad case is lots of subdirectories, like

    refs/heads/a0/b0/c0000
    refs/heads/a1/b0/c0001
    refs/heads/a2/b0/c0002
    ...
    refs/heads/a9/b9/c0999

, where I have basically sharded the references based on their last two
digits.  (It is just as slow if the sharding is based on the most
significant digits.)

On the other hand, if the refs are all in one directory (even a
subdirectory) like

    refs/heads/a/b/c0000
    refs/heads/a/b/c0001
    ...
    refs/heads/a/b/c0999

, then the times go down dramatically, and appear to go from O(N^3) to
O(N^2) (though O(N^2) still seems too slow for this loop!).

If the "git pack-refs" is omitted before "git filter-branch", then the
times become considerably worse, though I have not done systematic tests
on this variation.

I tried to reproduce the slowdown in a simpler scenario using various
loops over "git update-ref", but have so far been unsuccessful.
However, I have also observed a serious slowdown in the context of
making lots of automated fixes to a git repository converted from
Subversion (involving lots of calls to git-update-ref).  The slowdown in
this case could be largely mitigated by running "git pack-refs"
periodically, so it seems that the slowdown is related to the number of
unpacked refs in complicated directory structures.  Unfortunately it
would not be trivial to insert "git pack-refs" in the "git
filter-branch" loop because the loop relies on the presence of unpacked
references to "avoid rewriting a ref twice".

Does anybody have an idea?

Michael

Instrumented git-filter-branch:
> while read ref
> do
> echo 0 @@@
> 	# avoid rewriting a ref twice
> 	test -f "$orig_namespace$ref" && continue
> 
> echo 1 @@@
> 	sha1=$(git rev-parse "$ref"^0)
> echo 2 @@@
> 	rewritten=$(map $sha1)
> 
> echo 3 @@@
> 	test $sha1 = "$rewritten" &&
> 		warn "WARNING: Ref '$ref' is unchanged" &&
> 		continue
> 
> echo 4 @@@
> 	case "$rewritten" in
> 	'')
> echo 4a @@@
> 		echo "Ref '$ref' was deleted"
> 		git update-ref -m "filter-branch: delete" -d "$ref" $sha1 ||
> 			die "Could not delete $ref"
> 	;;
> 	$_x40)
> echo 4b @@@
> 		echo "Ref '$ref' was rewritten"
> 		if ! git update-ref -m "filter-branch: rewrite" \
> 					"$ref" $rewritten $sha1 2>/dev/null; then
> 			if test $(git cat-file -t "$ref") = tag; then
> 				if test -z "$filter_tag_name"; then
> 					warn "WARNING: You said to rewrite tagged commits, but not the corresponding tag."
> 					warn "WARNING: Perhaps use '--tag-name-filter cat' to rewrite the tag."
> 				fi
> 			else
> 				die "Could not rewrite $ref"
> 			fi
> 		fi
> 	;;
> 	*)
> echo 4c @@@
> 		# NEEDSWORK: possibly add -Werror, making this an error
> 		warn "WARNING: '$ref' was rewritten into multiple commits:"
> 		warn "$rewritten"
> 		warn "WARNING: Ref '$ref' points to the first one now."
> 		rewritten=$(echo "$rewritten" | head -n 1)
> 		git update-ref -m "filter-branch: rewrite to first" \
> 				"$ref" $rewritten $sha1 ||
> 			die "Could not rewrite $ref"
> 	;;
> 	esac
> echo 5 @@@
> 	git update-ref -m "filter-branch: backup" "$orig_namespace$ref" $sha1 ||
> 		 exit
> echo 6 @@@
> done < "$tempdir"/heads

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

[-- Attachment #2: filter-time --]
[-- Type: text/plain, Size: 2446 bytes --]

#! /usr/bin/python

import sys
import os
import subprocess
import shutil
import time

GIT_REPO = '/tmp/git-time-test'
FILENAME1 = os.path.join(GIT_REPO, 'a.txt')
FILENAME2 = os.path.join(GIT_REPO, 'b.txt')

N = 3000

def run(*args, **kw):
    if True:
        sys.stderr.write('Running %s\n' % (repr(args),))
    subprocess.check_call(*args, cwd=GIT_REPO, **kw)

def get_refnames(*patterns):
    cmd = ['git', 'for-each-ref', '--format=%(refname)'] + list(patterns)
    if True:
        sys.stderr.write('Running %s\n' % (repr(cmd),))
    p = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        cwd=GIT_REPO,
        )
    (out, err) = p.communicate()
    return [l.strip() for l in out.splitlines()]

def get_name(i):
    name = '%04d' % (i,)
    #return 'refs/heads/b%s' % (name,)
    return 'refs/heads/a/b/c%s' % (name,)
    #return 'refs/heads/a%s/b%s/c%s' % (name[-1], name[-2], name,)

def make_repo():
    if os.path.exists(GIT_REPO):
        shutil.rmtree(GIT_REPO)

    subprocess.check_call(['git', 'init', GIT_REPO])

    run(['git', 'config', 'user.name', 'Lou User'])
    run(['git', 'config', 'user.email', 'luser@example.com'])

    open(FILENAME1, 'w')
    run(['git', 'add', '-N', FILENAME1])

    for i in range(N):
        open(FILENAME1, 'w').write('%d\n' % (i,))
        run(['git', 'commit', '-a', '-m', 'Commit %d' % (i,)])
        #run(['git', 'update-ref', 'refs/tags/%04d' % (i,), 'HEAD'])
        run(['git', 'update-ref', get_name(i), 'HEAD'])

def pack_refs():
    run(['git', 'pack-refs', '--all', '--prune'])

def filter():
    run(['git', 'filter-branch', '--tree-filter', 'cp a.txt b.txt', '--', '--all'])

def add_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(i)
        run(['git', 'update-ref', '-m', 'add ref', new_refname, refname])

def move_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(N - 1 - i)
        run(['git', 'update-ref', '-m', 'move ref', new_refname, refname])

def main(args):
    make_repo()
    pack_refs()
    t = time.time()
    filter()
    #add_refs()
    #move_refs()
    print 'time to process: %.3f s' % (time.time() - t,)

main(sys.argv[1:])

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-14  7:16 Strange O(N^3) behavior in "git filter-branch" Michael Haggerty
@ 2011-07-14  9:24 ` Michael Haggerty
  2011-07-15  9:19   ` Michael Haggerty
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Haggerty @ 2011-07-14  9:24 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1161 bytes --]

On 07/14/2011 09:16 AM, Michael Haggerty wrote:
> I have noticed that "git filter-branch" gets pathologically slow when it
> operates on a repository that has many references in a complicated
> directory hierarchy.  The time seems to go like O(N^3), where N is the
> number of references being rewritten.

Correction and new information...

Correction: The test script that I attached to the last email was
configured incorrectly.  [Note to self: thunderbird reads attached files
at the moment the email is *sent*, not the moment that the attachment is
added in the compose window.]  The corrected script is attached.

New information: Once I have "primed" a git repository using the
attached script, a command like the following takes about 90 ms:

    git rev-parse refs/heads/a4/b1/c0414^0

strace reveals that it is calling stat64() then lstat64() on every
directory and every file under .git/refs.  By contrast,

    git rev-parse refs/heads/a4/b1/c0414

goes more or less straight to the file
.git/refs/tags/refs/heads/a4/b1/c0414, and finishes in a few milliseconds.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

[-- Attachment #2: filter-time --]
[-- Type: text/plain, Size: 2848 bytes --]

#! /usr/bin/python

import sys
import os
import subprocess
import shutil
import time


GIT_REPO = '/tmp/git-time-test'
FILENAME1 = os.path.join(GIT_REPO, 'a.txt')
FILENAME2 = os.path.join(GIT_REPO, 'b.txt')


N = 1000

#REF_STYLE = 'flat'
#REF_STYLE = 'subdir'
REF_STYLE = 'shard'

ACTION = 'filter'
#ACTION = 'add_refs'
#ACTION = 'move_refs'

PACK_REFS = True


def run(*args, **kw):
    if True:
        sys.stderr.write('Running %s\n' % (repr(args),))
    subprocess.check_call(*args, cwd=GIT_REPO, **kw)


def get_refnames(*patterns):
    cmd = ['git', 'for-each-ref', '--format=%(refname)'] + list(patterns)
    if True:
        sys.stderr.write('Running %s\n' % (repr(cmd),))
    p = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        cwd=GIT_REPO,
        )
    (out, err) = p.communicate()
    return [l.strip() for l in out.splitlines()]


def get_name(i):
    name = '%04d' % (i,)
    if REF_STYLE == 'flat':
        return 'refs/heads/b%s' % (name,)
    elif REF_STYLE == 'subdir':
        return 'refs/heads/a/b/c%s' % (name,)
    elif REF_STYLE == 'shard':
        return 'refs/heads/a%s/b%s/c%s' % (name[-1], name[-2], name,)


def make_repo():
    if os.path.exists(GIT_REPO):
        shutil.rmtree(GIT_REPO)

    subprocess.check_call(['git', 'init', GIT_REPO])

    run(['git', 'config', 'user.name', 'Lou User'])
    run(['git', 'config', 'user.email', 'luser@example.com'])

    open(FILENAME1, 'w')
    run(['git', 'add', '-N', FILENAME1])

    for i in range(N):
        open(FILENAME1, 'w').write('%d\n' % (i,))
        run(['git', 'commit', '-a', '-m', 'Commit %d' % (i,)])
        if ACTION != 'filter':
            run(['git', 'update-ref', 'refs/tags/%04d' % (i,), 'HEAD'])
        run(['git', 'update-ref', get_name(i), 'HEAD'])


def pack_refs():
    run(['git', 'pack-refs', '--all', '--prune'])


def filter():
    run(['git', 'filter-branch', '--tree-filter', 'cp a.txt b.txt', '--', '--all'])


def add_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(i)
        run(['git', 'update-ref', '-m', 'add ref', new_refname, refname])


def move_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(N - 1 - i)
        run(['git', 'update-ref', '-m', 'move ref', new_refname, refname])


def main(args):
    make_repo()

    if PACK_REFS:
        pack_refs()

    t = time.time()
    if ACTION == 'filter':
        filter()
    elif ACTION == 'add_refs':
        add_refs()
    elif ACTION == 'move_refs':
        move_refs()
    print 'time to process: %.3f s' % (time.time() - t,)


main(sys.argv[1:])


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-14  9:24 ` Michael Haggerty
@ 2011-07-15  9:19   ` Michael Haggerty
  2011-07-15 18:51     ` Junio C Hamano
  2011-08-03 13:33     ` Michael Haggerty
  0 siblings, 2 replies; 11+ messages in thread
From: Michael Haggerty @ 2011-07-15  9:19 UTC (permalink / raw)
  To: git

On 07/14/2011 11:24 AM, Michael Haggerty wrote:
> On 07/14/2011 09:16 AM, Michael Haggerty wrote:
>> I have noticed that "git filter-branch" gets pathologically slow when it
>> operates on a repository that has many references in a complicated
>> directory hierarchy.  The time seems to go like O(N^3), where N is the
>> number of references being rewritten.
> 
> New information: Once I have "primed" a git repository using the
> attached script, a command like the following takes about 90 ms:
> 
>     git rev-parse refs/heads/a4/b1/c0414^0
> 
> strace reveals that it is calling stat64() then lstat64() on every
> directory and every file under .git/refs.

It looks like the time is being taken up by attempted object
replacement.  The --no-replace-objects option makes the lookup fast again:

$ time git rev-parse
refs/heads/a4/b1/c0414^0c9d42cb64d465f354c1fa6eea5825ebb6d0483a7

real	0m0.093s
user	0m0.064s
sys	0m0.032s
$ time git --no-replace-objects rev-parse refs/heads/a4/b1/c0414^0
c9d42cb64d465f354c1fa6eea5825ebb6d0483a7

real	0m0.007s
user	0m0.000s
sys	0m0.012s

The guilty code path is here:

> #0  get_ref_dir (submodule=0x0, base=0x81384dd "refs", list=0x0) at refs.c:260
> #1  0x080f8a63 in get_loose_refs (submodule=0x0) at refs.c:368
> #2  0x080f912f in do_for_each_ref (submodule=0x0, base=<value optimized out>, fn=<value optimized out>, trim=13, flags=0, cb_data=0x0) at refs.c:663
> #3  0x080f92db in for_each_replace_ref (fn=0x80fd1c0 <register_replace_ref>, cb_data=0x0) at refs.c:782
> #4  0x080fd174 in prepare_replace_object (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at replace_object.c:86
> #5  do_lookup_replace_object (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at replace_object.c:103
> #6  0x080e7920 in lookup_replace_object (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at cache.h:774
> #7  parse_object (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at object.c:191
> #8  0x080bcf2a in lookup_commit_reference_gently (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247", quiet=0) at commit.c:30
> #9  0x080bcf89 in lookup_commit_reference (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at commit.c:39
> #10 0x0810e2a8 in get_parent (name=<value optimized out>, len=<value optimized out>, sha1=<value optimized out>) at sha1_name.c:456
> #11 get_sha1_1 (name=<value optimized out>, len=<value optimized out>, sha1=<value optimized out>) at sha1_name.c:663
> #12 0x0810e9dd in get_sha1_with_context_1 (name=0xbffff48c "refs/heads/a4/b1/c0414^0", sha1=0xbffff098 "\310\360\377\277\001C\021\b\300\265䷂\364\377\277", oc=0xbfffefec, only_to_die=0, prefix=0x0) at sha1_name.c:1123
> #13 0x0810f290 in get_sha1_with_context (name=0xbffff48c "refs/heads/a4/b1/c0414^0", sha1=0xbffff098 "\310\360\377\277\001C\021\b\300\265䷂\364\377\277") at cache.h:824
> #14 get_sha1 (name=0xbffff48c "refs/heads/a4/b1/c0414^0", sha1=0xbffff098 "\310\360\377\277\001C\021\b\300\265䷂\364\377\277") at sha1_name.c:987
> #15 0x080a258a in cmd_rev_parse (argc=2, argv=0xbffff2a4, prefix=0x0) at builtin/rev-parse.c:715
> #16 0x0804b917 in run_builtin (argc=<value optimized out>, argv=<value optimized out>) at git.c:302
> #17 handle_internal_command (argc=<value optimized out>, argv=<value optimized out>) at git.c:460
> #18 0x0804c00a in main (argc=2, argv=0xbffff2a4) at git.c:545

The replace_object machinery is causing the whole loose reference cache
to be populated even though it is only interested in references under
refs/replace (which, in this case, doesn't even exist).

A many possible improvements come to mind, in increasing order of
intrusiveness and generality:

0. Individual users can use --no-replace-objects or
GIT_NO_REPLACE_OBJECTS to reduce their pain, especially when running
git-filter-branch.  But this is cumbersome and error-prone.

1. Change git-filter-branch (and any other long-running commands?) to do
an initial check for the presence of replace references (packed or
loose), and if there are none, set GIT_NO_REPLACE_OBJECTS automatically.
 This would of course fail if any of the user's scripts try to set up
replace references.  (Side note: perhaps the git-replace command should
complain if GIT_NO_REPLACE_OBJECTS is turned on?  It would almost always
indicate a mistake.)  It also wouldn't help in repositories that *have*
replace references.

2. Add an option to git-filter-branch to have it pack references
occasionally.  This could even be made the default behavior if nobody
objects to having their references packed without their conscious decision.

3. Optimize the specific case where there is no refs/replace
directory--if this directory is missing, then defer populating the loose
refs cache in the hope that it will never be needed.

4. Optimize reading of loose refs in general--change get_loose_refs() to
take a "base" parameter, and defer populating the loose refs cache if
the directory corresponding to "base" is absent.  This would salvage the
worst cases, but would always force the whole loose refs cache to be
read even in cases when only part of it is needed.

5. Organize the loose refs cache in memory as a tree, and only populate
the parts of it that are accessed.  This should also speed up iteration
through a subtree by avoiding a linear search through all loose references.

Unfortunately, none of these improvements would change the fact that the
packed refs always have to be read in full, and thus that *basically
all* git commands scale like O(total number of references).  But packed
refs are so much faster than loose refs that this is probably acceptable.

I'd be willing to work on this problem if I get a little bit of feedback
from more experienced git developers about which approach would be
acceptable / preferred.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-15  9:19   ` Michael Haggerty
@ 2011-07-15 18:51     ` Junio C Hamano
  2011-07-15 21:20       ` Jeff King
  2011-08-03 13:33     ` Michael Haggerty
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2011-07-15 18:51 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> 1. Change git-filter-branch (and any other long-running commands?) to do
> an initial check for the presence of replace references (packed or
> loose), and if there are none, set GIT_NO_REPLACE_OBJECTS automatically.
>  This would of course fail if any of the user's scripts try to set up
> replace references.  (Side note: perhaps the git-replace command should
> complain if GIT_NO_REPLACE_OBJECTS is turned on?  It would almost always
> indicate a mistake.)  It also wouldn't help in repositories that *have*
> replace references.

In the short term I think this makes sense, as the whole point of using
filter-branch in a repository that has grafts and replacements is so that
the resulting history won't have to look-aside into grafts and replace
information.

But I think the replace-object codepath should be optimized to realize
there is no funky replacement (which _is_ a rare configuration) going on
much early so that it does not incur that much overhead you observed. IOW,
I tend to agree with your 3. below.

> 2. Add an option to git-filter-branch to have it pack references
> occasionally.

Doesn't the code already do this via "git gc" though?

> 3. Optimize the specific case where there is no refs/replace
> directory--if this directory is missing, then defer populating the loose
> refs cache in the hope that it will never be needed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-15 18:51     ` Junio C Hamano
@ 2011-07-15 21:20       ` Jeff King
  2011-07-16  5:26         ` Michael Haggerty
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2011-07-15 21:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, git

On Fri, Jul 15, 2011 at 11:51:49AM -0700, Junio C Hamano wrote:

> But I think the replace-object codepath should be optimized to realize
> there is no funky replacement (which _is_ a rare configuration) going on
> much early so that it does not incur that much overhead you observed. IOW,
> I tend to agree with your 3. below.
> [...]
> > 3. Optimize the specific case where there is no refs/replace
> > directory--if this directory is missing, then defer populating the loose
> > refs cache in the hope that it will never be needed.

It already tries to do so. It looks like it calls:

  for_each_replace_ref(...)

which calls:

  do_for_each_ref("refs/replace", ...)

which reads _every_ loose ref, regardless of the prefix we have given
it. So the optimization should go into the for_each_ref code, which
should avoid looking at parts of the hierarchy that are just going to be
culled, no?

And then we would see immediately that there is no refs/replace at all,
and quit early. And as a bonus, things like "for_each_tag_ref" would get
faster in a repository with many branches, too.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-15 21:20       ` Jeff King
@ 2011-07-16  5:26         ` Michael Haggerty
  2011-07-17 13:24           ` Drew Northup
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Haggerty @ 2011-07-16  5:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On 07/15/2011 11:20 PM, Jeff King wrote:
> On Fri, Jul 15, 2011 at 11:51:49AM -0700, Junio C Hamano wrote:
>> But I think the replace-object codepath should be optimized to realize
>> there is no funky replacement (which _is_ a rare configuration) going on
>> much early so that it does not incur that much overhead you observed. IOW,
>> I tend to agree with your 3. below.
>> [...]
>>> 3. Optimize the specific case where there is no refs/replace
>>> directory--if this directory is missing, then defer populating the loose
>>> refs cache in the hope that it will never be needed.
> 
> It already tries to do so. It looks like it calls:
> 
>   for_each_replace_ref(...)
> 
> which calls:
> 
>   do_for_each_ref("refs/replace", ...)
> 
> which reads _every_ loose ref, regardless of the prefix we have given
> it. So the optimization should go into the for_each_ref code, which
> should avoid looking at parts of the hierarchy that are just going to be
> culled, no?

The way to implement the deferral of reading refs if for_each_ref() is
called on an empty subtree (alternative 4 or 5 from my earlier email) is
to change get_loose_refs() to take a base argument, have it check
whether the directory corresponding to base is empty, and if so return
NULL without reading anything into the cache.

Currently, the loose ref cache is stored as a single linked list, so
there is no easy way to populate part of it now and part of it later.
So with the current data structure, the loose refs cache is
all-or-nothing.  It would be possible to avoid filling it if there are
not replace references, but if there is even one loose replace reference
then the whole refs tree would have to be crawled.  Implementing this
variation is alternative 4 from the early email.

More flexible would be to change the way the loose ref cache is stored
from a linked list into a tree (probably mirroring the directory tree).
 If this were done, then it would be possible to populate the cache
lazily, only crawling the part of the refs tree that is needed for a
particular call of for_each_ref() and reusing any part of the cache that
is already in memory.  This (alternative "5") is considerably more
involved because the data structure has to be changed, but also
potentially a much bigger win because the presence of a single reference
in refs/replace would not force all of the refs/heads and refs/tags and
... to be read.

> And then we would see immediately that there is no refs/replace at all,
> and quit early. And as a bonus, things like "for_each_tag_ref" would get
> faster in a repository with many branches, too.

Only if the data structure used to hold the loose refs cache is changed...

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-16  5:26         ` Michael Haggerty
@ 2011-07-17 13:24           ` Drew Northup
  2011-07-18  8:59             ` Jakub Narebski
  0 siblings, 1 reply; 11+ messages in thread
From: Drew Northup @ 2011-07-17 13:24 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Junio C Hamano, git


On Sat, 2011-07-16 at 07:26 +0200, Michael Haggerty wrote:

> Currently, the loose ref cache is stored as a single linked list, so
> there is no easy way to populate part of it now and part of it later.
> So with the current data structure, the loose refs cache is
> all-or-nothing.  It would be possible to avoid filling it if there are
> not replace references, but if there is even one loose replace reference
> then the whole refs tree would have to be crawled.  Implementing this
> variation is alternative 4 from the early email.
> 
> More flexible would be to change the way the loose ref cache is stored
> from a linked list into a tree (probably mirroring the directory tree).

Given the potential for high performance inherent with trees, why mix
metaphors like this? What would the gain be?

>  If this were done, then it would be possible to populate the cache
> lazily, only crawling the part of the refs tree that is needed for a
> particular call of for_each_ref() and reusing any part of the cache that
> is already in memory.  

Is this the argument for directory structure mirroring?

-- 
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-17 13:24           ` Drew Northup
@ 2011-07-18  8:59             ` Jakub Narebski
  2011-07-18 16:01               ` Drew Northup
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Narebski @ 2011-07-18  8:59 UTC (permalink / raw)
  To: Drew Northup; +Cc: Michael Haggerty, Jeff King, Junio C Hamano, git

Drew Northup <drew.northup@maine.edu> writes:

> On Sat, 2011-07-16 at 07:26 +0200, Michael Haggerty wrote:
> 
> > Currently, the loose ref cache is stored as a single linked list, so
> > there is no easy way to populate part of it now and part of it later.
> > So with the current data structure, the loose refs cache is
> > all-or-nothing.  It would be possible to avoid filling it if there are
> > not replace references, but if there is even one loose replace reference
> > then the whole refs tree would have to be crawled.  Implementing this
> > variation is alternative 4 from the early email.
> > 
> > More flexible would be to change the way the loose ref cache is stored
> > from a linked list into a tree (probably mirroring the directory tree).
> 
> Given the potential for high performance inherent with trees, why mix
> metaphors like this? What would the gain be?

Did you mean: "why linked list"?  I _guess_ that it is most probably
because linked list is simpler and better known data structure than
non-binary tree.

What is needed I think is something like trie[1], but with path
components and not letters stored in trie nodes.

[1]: http://en.wikipedia.org/wiki/Trie
 
> >  If this were done, then it would be possible to populate the cache
> > lazily, only crawling the part of the refs tree that is needed for a
> > particular call of for_each_ref() and reusing any part of the cache that
> > is already in memory.  
> 
> Is this the argument for directory structure mirroring?

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-18  8:59             ` Jakub Narebski
@ 2011-07-18 16:01               ` Drew Northup
  0 siblings, 0 replies; 11+ messages in thread
From: Drew Northup @ 2011-07-18 16:01 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Michael Haggerty, Jeff King, Junio C Hamano, git


On Mon, 2011-07-18 at 01:59 -0700, Jakub Narebski wrote:
> Drew Northup <drew.northup@maine.edu> writes:
> 
> > On Sat, 2011-07-16 at 07:26 +0200, Michael Haggerty wrote:
> > 
> > > Currently, the loose ref cache is stored as a single linked list, so
> > > there is no easy way to populate part of it now and part of it later.
> > > So with the current data structure, the loose refs cache is
> > > all-or-nothing.  It would be possible to avoid filling it if there are
> > > not replace references, but if there is even one loose replace reference
> > > then the whole refs tree would have to be crawled.  Implementing this
> > > variation is alternative 4 from the early email.
> > > 
> > > More flexible would be to change the way the loose ref cache is stored
> > > from a linked list into a tree (probably mirroring the directory tree).
> > 
> > Given the potential for high performance inherent with trees, why mix
> > metaphors like this? What would the gain be?
> 
> Did you mean: "why linked list"?  I _guess_ that it is most probably
> because linked list is simpler and better known data structure than
> non-binary tree.

No, that I can compute. I was asking why mix tree metaphors (pure
binary, R/B, and 234 being probably the most common kinds for the data
structure; and filesystem "trees"). In my mind I was thinking of
SHA1sums as the keys (for some reason that doesn't occur to me right
now) and thought perhaps it was worth becoming enlightened (or
something). Perhaps I should have looked harder in my mail queue for the
patch referenced.

> 
> What is needed I think is something like trie[1], but with path
> components and not letters stored in trie nodes.
> 
> [1]: http://en.wikipedia.org/wiki/Trie

Obviously, there are other "tree" structures. That's one I probably
should have thought of earlier.

-- 
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-07-15  9:19   ` Michael Haggerty
  2011-07-15 18:51     ` Junio C Hamano
@ 2011-08-03 13:33     ` Michael Haggerty
  2011-08-03 19:37       ` Jeff King
  1 sibling, 1 reply; 11+ messages in thread
From: Michael Haggerty @ 2011-08-03 13:33 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Drew Northup, Jakub Narebski

On 07/15/2011 11:19 AM, Michael Haggerty wrote:
> On 07/14/2011 11:24 AM, Michael Haggerty wrote:
>> On 07/14/2011 09:16 AM, Michael Haggerty wrote:
>>> I have noticed that "git filter-branch" gets pathologically slow when it
>>> operates on a repository that has many references in a complicated
>>> directory hierarchy.  The time seems to go like O(N^3), where N is the
>>> number of references being rewritten.
> [...]
> A many possible improvements come to mind, in increasing order of
> intrusiveness and generality:
> [...]
> 5. Organize the loose refs cache in memory as a tree, and only populate
> the parts of it that are accessed.  This should also speed up iteration
> through a subtree by avoiding a linear search through all loose references.

FYI: I am working on (5), namely storing a linked list of loose refs for
each directory and only populating those directories that are accessed.
 The directories themselves will be held in a tree/trie (AFAICT the
distinction is primarily whether each node holds its whole key or only
the part of the key relative to its parent, which is an implementation
detail).  As a bonus, the caches for submodules will be handled
correctly (they are currently never used).

It might be another week or so before I have patches ready.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange O(N^3) behavior in "git filter-branch"
  2011-08-03 13:33     ` Michael Haggerty
@ 2011-08-03 19:37       ` Jeff King
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2011-08-03 19:37 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Junio C Hamano, Drew Northup, Jakub Narebski

On Wed, Aug 03, 2011 at 03:33:39PM +0200, Michael Haggerty wrote:

> On 07/15/2011 11:19 AM, Michael Haggerty wrote:
> > On 07/14/2011 11:24 AM, Michael Haggerty wrote:
> >> On 07/14/2011 09:16 AM, Michael Haggerty wrote:
> >>> I have noticed that "git filter-branch" gets pathologically slow when it
> >>> operates on a repository that has many references in a complicated
> >>> directory hierarchy.  The time seems to go like O(N^3), where N is the
> >>> number of references being rewritten.
> > [...]
> > A many possible improvements come to mind, in increasing order of
> > intrusiveness and generality:
> > [...]
> > 5. Organize the loose refs cache in memory as a tree, and only populate
> > the parts of it that are accessed.  This should also speed up iteration
> > through a subtree by avoiding a linear search through all loose references.
> 
> FYI: I am working on (5), namely storing a linked list of loose refs for
> each directory and only populating those directories that are accessed.
>  The directories themselves will be held in a tree/trie (AFAICT the
> distinction is primarily whether each node holds its whole key or only
> the part of the key relative to its parent, which is an implementation
> detail).  As a bonus, the caches for submodules will be handled
> correctly (they are currently never used).
> 
> It might be another week or so before I have patches ready.

Great. That is exactly the solution I was going to pursue, as well, but
I didn't actually start on it yet. I look forward to seeing your
patches.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-08-03 19:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-14  7:16 Strange O(N^3) behavior in "git filter-branch" Michael Haggerty
2011-07-14  9:24 ` Michael Haggerty
2011-07-15  9:19   ` Michael Haggerty
2011-07-15 18:51     ` Junio C Hamano
2011-07-15 21:20       ` Jeff King
2011-07-16  5:26         ` Michael Haggerty
2011-07-17 13:24           ` Drew Northup
2011-07-18  8:59             ` Jakub Narebski
2011-07-18 16:01               ` Drew Northup
2011-08-03 13:33     ` Michael Haggerty
2011-08-03 19:37       ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).