All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: git@vger.kernel.org
Subject: Re: Strange O(N^3) behavior in "git filter-branch"
Date: Thu, 14 Jul 2011 11:24:57 +0200	[thread overview]
Message-ID: <4E1EB5E9.1070902@alum.mit.edu> (raw)
In-Reply-To: <4E1E97C3.3030306@alum.mit.edu>

[-- Attachment #1: Type: text/plain, Size: 1161 bytes --]

On 07/14/2011 09:16 AM, Michael Haggerty wrote:
> I have noticed that "git filter-branch" gets pathologically slow when it
> operates on a repository that has many references in a complicated
> directory hierarchy.  The time seems to go like O(N^3), where N is the
> number of references being rewritten.

Correction and new information...

Correction: The test script that I attached to the last email was
configured incorrectly.  [Note to self: thunderbird reads attached files
at the moment the email is *sent*, not the moment that the attachment is
added in the compose window.]  The corrected script is attached.

New information: Once I have "primed" a git repository using the
attached script, a command like the following takes about 90 ms:

    git rev-parse refs/heads/a4/b1/c0414^0

strace reveals that it is calling stat64() then lstat64() on every
directory and every file under .git/refs.  By contrast,

    git rev-parse refs/heads/a4/b1/c0414

goes more or less straight to the file
.git/refs/tags/refs/heads/a4/b1/c0414, and finishes in a few milliseconds.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

[-- Attachment #2: filter-time --]
[-- Type: text/plain, Size: 2848 bytes --]

#! /usr/bin/python

import sys
import os
import subprocess
import shutil
import time


GIT_REPO = '/tmp/git-time-test'
FILENAME1 = os.path.join(GIT_REPO, 'a.txt')
FILENAME2 = os.path.join(GIT_REPO, 'b.txt')


N = 1000

#REF_STYLE = 'flat'
#REF_STYLE = 'subdir'
REF_STYLE = 'shard'

ACTION = 'filter'
#ACTION = 'add_refs'
#ACTION = 'move_refs'

PACK_REFS = True


def run(*args, **kw):
    if True:
        sys.stderr.write('Running %s\n' % (repr(args),))
    subprocess.check_call(*args, cwd=GIT_REPO, **kw)


def get_refnames(*patterns):
    cmd = ['git', 'for-each-ref', '--format=%(refname)'] + list(patterns)
    if True:
        sys.stderr.write('Running %s\n' % (repr(cmd),))
    p = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        cwd=GIT_REPO,
        )
    (out, err) = p.communicate()
    return [l.strip() for l in out.splitlines()]


def get_name(i):
    name = '%04d' % (i,)
    if REF_STYLE == 'flat':
        return 'refs/heads/b%s' % (name,)
    elif REF_STYLE == 'subdir':
        return 'refs/heads/a/b/c%s' % (name,)
    elif REF_STYLE == 'shard':
        return 'refs/heads/a%s/b%s/c%s' % (name[-1], name[-2], name,)


def make_repo():
    if os.path.exists(GIT_REPO):
        shutil.rmtree(GIT_REPO)

    subprocess.check_call(['git', 'init', GIT_REPO])

    run(['git', 'config', 'user.name', 'Lou User'])
    run(['git', 'config', 'user.email', 'luser@example.com'])

    open(FILENAME1, 'w')
    run(['git', 'add', '-N', FILENAME1])

    for i in range(N):
        open(FILENAME1, 'w').write('%d\n' % (i,))
        run(['git', 'commit', '-a', '-m', 'Commit %d' % (i,)])
        if ACTION != 'filter':
            run(['git', 'update-ref', 'refs/tags/%04d' % (i,), 'HEAD'])
        run(['git', 'update-ref', get_name(i), 'HEAD'])


def pack_refs():
    run(['git', 'pack-refs', '--all', '--prune'])


def filter():
    run(['git', 'filter-branch', '--tree-filter', 'cp a.txt b.txt', '--', '--all'])


def add_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(i)
        run(['git', 'update-ref', '-m', 'add ref', new_refname, refname])


def move_refs():
    refnames = get_refnames('refs/tags')

    for refname in refnames:
        assert not refname.endswith('master')
        i = int(refname.rsplit('/', 1)[-1])
        new_refname = get_name(N - 1 - i)
        run(['git', 'update-ref', '-m', 'move ref', new_refname, refname])


def main(args):
    make_repo()

    if PACK_REFS:
        pack_refs()

    t = time.time()
    if ACTION == 'filter':
        filter()
    elif ACTION == 'add_refs':
        add_refs()
    elif ACTION == 'move_refs':
        move_refs()
    print 'time to process: %.3f s' % (time.time() - t,)


main(sys.argv[1:])


  reply	other threads:[~2011-07-14  9:25 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-14  7:16 Strange O(N^3) behavior in "git filter-branch" Michael Haggerty
2011-07-14  9:24 ` Michael Haggerty [this message]
2011-07-15  9:19   ` Michael Haggerty
2011-07-15 18:51     ` Junio C Hamano
2011-07-15 21:20       ` Jeff King
2011-07-16  5:26         ` Michael Haggerty
2011-07-17 13:24           ` Drew Northup
2011-07-18  8:59             ` Jakub Narebski
2011-07-18 16:01               ` Drew Northup
2011-08-03 13:33     ` Michael Haggerty
2011-08-03 19:37       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E1EB5E9.1070902@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.