From: Felipe Contreras <felipe.contreras@gmail.com>
To: John Szakmeister <john@szakmeister.net>
Cc: git@vger.kernel.org
Subject: Re: Is there a way to speed up remote-hg?
Date: Sat, 20 Apr 2013 18:07:41 -0500 [thread overview]
Message-ID: <CAMP44s3Rh6Ef8aSif39UXgr5tqBCCMj92vo_MFoDOupu3Xj8Hw@mail.gmail.com> (raw)
In-Reply-To: <CAEBDL5XO4oU9QL1=kQ_f8_MM9jHAKQojMQr_6VSZsEYNY7PLpA@mail.gmail.com>
On Sat, Apr 20, 2013 at 6:07 AM, John Szakmeister <john@szakmeister.net> wrote:
> I really like the idea of remote-hg, but it appears to be awfully slow
> on the clone step:
The short answer is no. I do have a couple of patches that improve
performance, but not by a huge factor.
I have profiled the code, and there are two significant places where
performance is wasted:
1) Fetching the file contents
Extracting, decompressing, transferring, and then compressing and
storing the file contents is mostly unavoidable, unless we already
have the contents of such file, which in Git, it would be easy to
check by analyzing the checksum (SHA-1). Unfortunately Mercurial
doesn't have that information. The SHA-1 that is stored is not of the
contents, but the contents and the parent checksum, which means that
if you revert a modification you made to a file, or move a file, any
operation that ends up in the same contents, but from a different
path, the SHA-1 is different. This means the only way to know if the
contents are the same, is by extracting, and calculating the SHA-1
yourself, which defeats the purpose of what you want the calculation
for.
I've tried, calculating the SHA-1 and use a previous reference to
avoid the transfer, or do the transfer, and let Git check for existing
objects doesn't make a difference.
This is by Mercurial's stupid design, and there's nothing we, or
anybody could do about it until they change it.
2) Checking for file changes
For each commit (or revision), we need to figure out which files were
modified, and for that, Mercurial has a neat shortcut that stores such
modifications in the commit context itself, so it's easy to retrieve.
Unfortunately, it's sometimes wrong.
Since the Mercurial tools never use this information for any real
work, simply to show the changes to the users, Mercurial folks never
noticed the contents they were storing were wrong. Which means if you
have a repository that started with old versions of mercurial, chances
are this information would be wrong, and there's no real guarantee
that future versions won't have this problem, since to this day this
information continues to be used only display stuff to the user.
So, since we cannot rely on this, we need to manually check for
differences the way Mercurial does, which blows performance away,
because you need to get the contents of the two parent revisions, and
compare them away. My content I mean the the manifest, or list of
files, which takes considerable amount of time.
For 1) there's nothing we can do, and for 2) we could trust the files
Mercurial thinks were modified, and that gives us a very significant
boost, but the repository will sometimes end up wrong. Most of the
time is spent on 2).
So unfortunately there's nothing we can do, that's just Mercurial
design, and it really has nothing to do with Git. Any other tool would
have the same problems, even a tool that converts a Mercurial
repository to Mercurial (without using tricks).
It seems Bazaar is more sensible in this regard; 1) the checksums are
try of the file contents, and 2) each revision does store the file
modifications correctly. So a clone in Bazaar is much faster. In my
opinion Mercurial just screwed up their design.
Cheers.
--
Felipe Contreras
next prev parent reply other threads:[~2013-04-20 23:07 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-20 11:07 Is there a way to speed up remote-hg? John Szakmeister
2013-04-20 23:07 ` Felipe Contreras [this message]
2013-04-21 12:59 ` John Szakmeister
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAMP44s3Rh6Ef8aSif39UXgr5tqBCCMj92vo_MFoDOupu3Xj8Hw@mail.gmail.com \
--to=felipe.contreras@gmail.com \
--cc=git@vger.kernel.org \
--cc=john@szakmeister.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).