From: Johan Herland <johan@herland.net>
To: git@vger.kernel.org
Cc: "Kristian Høgsberg" <krh@redhat.com>
Subject: [RFC] git-clone should create packed refs
Date: Fri, 15 Feb 2008 01:33:19 +0100 [thread overview]
Message-ID: <200802150133.19247.johan@herland.net> (raw)
Hi,
I'm experimenting with converting deep (lots of history) CVS repos to Git,
and I notice that cloning the resulting Git repos is _slow_. E.g. an
example repo with 10000 tags and 1000 branches will take ~24 seconds to
clone. Debugging shows that >95% of that time is spent by calling "git
update-ref" for each of the 11000 refs. I can easily get the total runtime
down to ~4 seconds by replacing the "git update-ref ..." with something
like "echo $sha1 $destname >> $GIT_DIR/packed-refs". Some more
investigation shows that what's actually taking so long is not writing all
these 40-bytes ref files and their corresponding reflogs, but rather the
overhead of creating the "git update-ref" process 11000 times (echo is a
shell builtin, I presume, so doesn't have the same overhead). My conclusion
is therefore that making "git clone" a builtin will solve my performance
problems (since the update-ref is now a function call, rather than a
subprocess).
Searching the list, I find that - lo and behold - someone (CCed) is actually
already working on this. :)
(BTW, a progress report on this work would be nice...)
So the only niggle I have left, is that when git-clone is cloning repos with
thousands of refs, it makes sense to create a packed-refs file directly in
the clone, instead of having to run "git pack-refs" (or "git gc")
afterwards to (re)pack the refs. This has pretty much the same reasoning as
transferring and storing the objects in packs instead of exploding them
into loose objects.
In my case, the upstream repo already has packed refs, so it just seems
stupid to explode them into "loose" refs when cloning, and make me re-pack
them afterwards.
Looking at git-clone.sh, I even find that when cloning, the refs are
transferred in a format similar (but not identical) to the packed-refs file
format (see CLONE_HEAD in git-clone.sh).
AFAICS, the only complication with this proposal is how to deal with the
reflogs. Right now, for each ref created, a corresponding reflog with a
single entry is written. Therefore - in my example repo above - the
current "git clone" writes ~22000 files, and my proposal offers only a net
reduction in #files written by ~50%, instead of ~100%. For reference, the
reflog entries written by "git clone" look like this:
"000... $sha1 A U Thor <e@mail> $timestamp clone: from $repo"
IMHO, these entries don't carry much value:
- The $sha1 is self-evident (and if later changed, will still be mentioned
in the next reflog entry).
- The author name and email would probably be self-evident/uninteresting in
most cases.
- The timestamp might be marginally useful, as I can't immediately document
another way of getting the time of cloning.
- The $repo would also be self-evident in many cases, and would in any case
also be listed in the config file in the "origin" remote section.
I'd therefore suggest to make reflog creation in "git clone" optional, in
order to avoid having the number of files written be proportional to the
number of refs.
I would imagine that even though the time used on Linux for writing
thousands of files might be negligible, this is not the case on certain
other OSes...
Have fun! :)
...Johan
--
Johan Herland, <johan@herland.net>
www.herland.net
next reply other threads:[~2008-02-15 0:35 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-15 0:33 Johan Herland [this message]
2008-02-15 0:53 ` [RFC] git-clone should create packed refs Johannes Schindelin
2008-02-15 1:13 ` Johan Herland
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200802150133.19247.johan@herland.net \
--to=johan@herland.net \
--cc=git@vger.kernel.org \
--cc=krh@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).