git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: git@vger.kernel.org
Subject: Re: Strange O(N^3) behavior in "git filter-branch"
Date: Thu, 14 Jul 2011 11:24:57 +0200	[thread overview]
Message-ID: <4E1EB5E9.1070902@alum.mit.edu> (raw)
In-Reply-To: <4E1E97C3.3030306@alum.mit.edu>

[-- Attachment #1: Type: text/plain, Size: 1161 bytes --]

On 07/14/2011 09:16 AM, Michael Haggerty wrote:
> I have noticed that "git filter-branch" gets pathologically slow when it
> operates on a repository that has many references in a complicated
> directory hierarchy.  The time seems to go like O(N^3), where N is the
> number of references being rewritten.

Correction and new information...

Correction: The test script that I attached to the last email was
configured incorrectly.  [Note to self: thunderbird reads attached files
at the moment the email is *sent*, not the moment that the attachment is
added in the compose window.]  The corrected script is attached.

New information: Once I have "primed" a git repository using the
attached script, a command like the following takes about 90 ms:

    git rev-parse refs/heads/a4/b1/c0414^0

strace reveals that it is calling stat64() then lstat64() on every
directory and every file under .git/refs.  By contrast,

    git rev-parse refs/heads/a4/b1/c0414

goes more or less straight to the file
.git/refs/tags/refs/heads/a4/b1/c0414, and finishes in a few milliseconds.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

[-- Attachment #2: filter-time --]
[-- Type: text/plain, Size: 2848 bytes --]

#! /usr/bin/python

import sys
import os
import subprocess
import shutil
import time


GIT_REPO = '/tmp/git-time-test'
FILENAME1 = os.path.join(GIT_REPO, 'a.txt')
FILENAME2 = os.path.join(GIT_REPO, 'b.txt')


N = 1000

#REF_STYLE = 'flat'
#REF_STYLE = 'subdir'
REF_STYLE = 'shard'

ACTION = 'filter'
#ACTION = 'add_refs'
#ACTION = 'move_refs'

PACK_REFS = True


def run(*args, **kw):
    if True:
        sys.stderr.write('Running %s\n' % (repr(args),))
    subprocess.check_call(*args, cwd=GIT_REPO, **kw)


def get_refnames(*patterns):
    cmd = ['git', 'for-each-ref', '--format=%(refname)'] + list(patterns)
    if True:
        sys.stderr.write('Running %s\n' % (repr(cmd),))
    p = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        cwd=GIT_REPO,
        )
    (out, err) = p.communicate()
    return [l.strip() for l in out.splitlines()]


def get_name(i):
    name = '%04d' % (i,)
    if REF_STYLE == 'flat':
        return 'refs/heads/b%s' % (name,)
    elif REF_STYLE == 'subdir':
        return 'refs/heads/a/b/c%s' % (name,)
    elif REF_STYLE == 'shard':
        return 'refs/heads/a%s/b%s/c%s' % (name[-1], name[-2], name,)


def make_repo():
    if os.path.exists(GIT_REPO):
        shutil.rmtree(GIT_REPO)

    subprocess.check_call(['git', 'init', GIT_REPO])

    run(['git', 'config', 'user.name', 'Lou User'])
    run(['git', 'config', 'user.email', 'luser@example.com'])

    open(FILENAME1, 'w')
    run(['git', 'add', '-N', FILENAME1])

    for i in range(N):
        open(FILENAME1, 'w').write('%d\n' % (i,))
        run(['git', 'commit', '-a', '-m', 'Commit %d' % (i,)])
        if ACTION != 'filter':
            run(['git', 'update-ref', 'refs/tags/%04d' % (i,), 'HEAD'])
        run(['git', 'update-ref', get_name(i), 'HEAD'])


def pack_refs():
    run(['git', 'pack-refs', '--all', '--prune'])


def filter():
    run(['git', 'filter-branch', '--tree-filter', 'cp a.txt b.txt', '--', '--all'])


def add_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(i)
        run(['git', 'update-ref', '-m', 'add ref', new_refname, refname])


def move_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(N - 1 - i)
        run(['git', 'update-ref', '-m', 'move ref', new_refname, refname])


def main(args):
    make_repo()

    if PACK_REFS:
        pack_refs()

    t = time.time()
    if ACTION == 'filter':
        filter()
    elif ACTION == 'add_refs':
        add_refs()
    elif ACTION == 'move_refs':
        move_refs()
    print 'time to process: %.3f s' % (time.time() - t,)


main(sys.argv[1:])


  reply	other threads:[~2011-07-14  9:25 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-14  7:16 Strange O(N^3) behavior in "git filter-branch" Michael Haggerty
2011-07-14  9:24 ` Michael Haggerty [this message]
2011-07-15  9:19   ` Michael Haggerty
2011-07-15 18:51     ` Junio C Hamano
2011-07-15 21:20       ` Jeff King
2011-07-16  5:26         ` Michael Haggerty
2011-07-17 13:24           ` Drew Northup
2011-07-18  8:59             ` Jakub Narebski
2011-07-18 16:01               ` Drew Northup
2011-08-03 13:33     ` Michael Haggerty
2011-08-03 19:37       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E1EB5E9.1070902@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).