From: Michael Haggerty <mhagger@alum.mit.edu>
To: git@vger.kernel.org
Subject: Re: Strange O(N^3) behavior in "git filter-branch"
Date: Thu, 14 Jul 2011 11:24:57 +0200 [thread overview]
Message-ID: <4E1EB5E9.1070902@alum.mit.edu> (raw)
In-Reply-To: <4E1E97C3.3030306@alum.mit.edu>
[-- Attachment #1: Type: text/plain, Size: 1161 bytes --]
On 07/14/2011 09:16 AM, Michael Haggerty wrote:
> I have noticed that "git filter-branch" gets pathologically slow when it
> operates on a repository that has many references in a complicated
> directory hierarchy. The time seems to go like O(N^3), where N is the
> number of references being rewritten.
Correction and new information...
Correction: The test script that I attached to the last email was
configured incorrectly. [Note to self: thunderbird reads attached files
at the moment the email is *sent*, not the moment that the attachment is
added in the compose window.] The corrected script is attached.
New information: Once I have "primed" a git repository using the
attached script, a command like the following takes about 90 ms:
git rev-parse refs/heads/a4/b1/c0414^0
strace reveals that it is calling stat64() then lstat64() on every
directory and every file under .git/refs. By contrast,
git rev-parse refs/heads/a4/b1/c0414
goes more or less straight to the file
.git/refs/tags/refs/heads/a4/b1/c0414, and finishes in a few milliseconds.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
[-- Attachment #2: filter-time --]
[-- Type: text/plain, Size: 2848 bytes --]
#! /usr/bin/python
import sys
import os
import subprocess
import shutil
import time
GIT_REPO = '/tmp/git-time-test'
FILENAME1 = os.path.join(GIT_REPO, 'a.txt')
FILENAME2 = os.path.join(GIT_REPO, 'b.txt')
N = 1000
#REF_STYLE = 'flat'
#REF_STYLE = 'subdir'
REF_STYLE = 'shard'
ACTION = 'filter'
#ACTION = 'add_refs'
#ACTION = 'move_refs'
PACK_REFS = True
def run(*args, **kw):
if True:
sys.stderr.write('Running %s\n' % (repr(args),))
subprocess.check_call(*args, cwd=GIT_REPO, **kw)
def get_refnames(*patterns):
cmd = ['git', 'for-each-ref', '--format=%(refname)'] + list(patterns)
if True:
sys.stderr.write('Running %s\n' % (repr(cmd),))
p = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
cwd=GIT_REPO,
)
(out, err) = p.communicate()
return [l.strip() for l in out.splitlines()]
def get_name(i):
name = '%04d' % (i,)
if REF_STYLE == 'flat':
return 'refs/heads/b%s' % (name,)
elif REF_STYLE == 'subdir':
return 'refs/heads/a/b/c%s' % (name,)
elif REF_STYLE == 'shard':
return 'refs/heads/a%s/b%s/c%s' % (name[-1], name[-2], name,)
def make_repo():
if os.path.exists(GIT_REPO):
shutil.rmtree(GIT_REPO)
subprocess.check_call(['git', 'init', GIT_REPO])
run(['git', 'config', 'user.name', 'Lou User'])
run(['git', 'config', 'user.email', 'luser@example.com'])
open(FILENAME1, 'w')
run(['git', 'add', '-N', FILENAME1])
for i in range(N):
open(FILENAME1, 'w').write('%d\n' % (i,))
run(['git', 'commit', '-a', '-m', 'Commit %d' % (i,)])
if ACTION != 'filter':
run(['git', 'update-ref', 'refs/tags/%04d' % (i,), 'HEAD'])
run(['git', 'update-ref', get_name(i), 'HEAD'])
def pack_refs():
run(['git', 'pack-refs', '--all', '--prune'])
def filter():
run(['git', 'filter-branch', '--tree-filter', 'cp a.txt b.txt', '--', '--all'])
def add_refs():
refnames = get_refnames('refs/tags')
for refname in refnames:
assert not refname.endswith('master')
i = int(refname.rsplit('/', 1)[-1])
new_refname = get_name(i)
run(['git', 'update-ref', '-m', 'add ref', new_refname, refname])
def move_refs():
refnames = get_refnames('refs/tags')
for refname in refnames:
assert not refname.endswith('master')
i = int(refname.rsplit('/', 1)[-1])
new_refname = get_name(N - 1 - i)
run(['git', 'update-ref', '-m', 'move ref', new_refname, refname])
def main(args):
make_repo()
if PACK_REFS:
pack_refs()
t = time.time()
if ACTION == 'filter':
filter()
elif ACTION == 'add_refs':
add_refs()
elif ACTION == 'move_refs':
move_refs()
print 'time to process: %.3f s' % (time.time() - t,)
main(sys.argv[1:])
next prev parent reply other threads:[~2011-07-14 9:25 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-14 7:16 Strange O(N^3) behavior in "git filter-branch" Michael Haggerty
2011-07-14 9:24 ` Michael Haggerty [this message]
2011-07-15 9:19 ` Michael Haggerty
2011-07-15 18:51 ` Junio C Hamano
2011-07-15 21:20 ` Jeff King
2011-07-16 5:26 ` Michael Haggerty
2011-07-17 13:24 ` Drew Northup
2011-07-18 8:59 ` Jakub Narebski
2011-07-18 16:01 ` Drew Northup
2011-08-03 13:33 ` Michael Haggerty
2011-08-03 19:37 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E1EB5E9.1070902@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).