Mercurial on BigTable

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Mercurial on BigTable
@ 2009-06-10 19:15 Scott Chacon
  2009-06-10 19:23 ` Sverre Rabbelier
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Scott Chacon @ 2009-06-10 19:15 UTC (permalink / raw)
  To: git list

Has anyone watched this yet?

http://code.google.com/events/io/sessions/MercurialBigTable.html

It's kind of interesting - a Googler talks about getting Mercurial
running on BigTable.  What fascinates me is that if I'm not horribly
mistaken, it seems like they just threw out the revlog format entirely
and just store the data in a key-value store as sort of a Git-like
content addressable filesystem.  I had thought they were taking
advantage of the revlog structure somehow, but it appears like they
basically just changed the underlying data format to be much more like
Git and rewrote ah Hg speaking server on top of that.  They even
explicitly store the head values like refs instead of reading
childless nodes out of the revlog, which is what I thought Hg did.

Does anyone know how they do the graph walking efficiently with this
structure?  He mentioned it was about half as fast as native Hg, but
that seemed to be acceptable.  Curious if anyone had any thoughts or
information on this.  Shawn, are there technical reasons why this
works well the way they're doing it for Hg but would not for Git (like
in the repo MINA based server)?  It looks like the data structure and
protocol exchange are incredibly similar after they threw away all the
revlog stuff.  Or is it just that they're fine with the speed loss and
the Android project would not be?

Scott

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mercurial on BigTable
  2009-06-10 19:15 Mercurial on BigTable Scott Chacon
@ 2009-06-10 19:23 ` Sverre Rabbelier
  2009-06-11  2:02 ` Andreas Ericsson
  2009-06-12  4:14 ` Shawn O. Pearce
  2 siblings, 0 replies; 8+ messages in thread
From: Sverre Rabbelier @ 2009-06-10 19:23 UTC (permalink / raw)
  To: Scott Chacon; +Cc: git list

Heya,

On Wed, Jun 10, 2009 at 21:15, Scott Chacon<schacon@gmail.com> wrote:
> in the repo MINA based server)?  It looks like the data structure and
> protocol exchange are incredibly similar after they threw away all the
> revlog stuff.  Or is it just that they're fine with the speed loss and
> the Android project would not be?

There's one explanation for that, git's lack of a decent HTTP
protocol, no? The speed loss of http:// vs git:// is still huge :(.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mercurial on BigTable
  2009-06-10 19:15 Mercurial on BigTable Scott Chacon
  2009-06-10 19:23 ` Sverre Rabbelier
@ 2009-06-11  2:02 ` Andreas Ericsson
  2009-06-11  8:24   ` Jakub Narebski
  2009-06-11 14:37   ` Sitaram Chamarty
  2009-06-12  4:14 ` Shawn O. Pearce
  2 siblings, 2 replies; 8+ messages in thread
From: Andreas Ericsson @ 2009-06-11  2:02 UTC (permalink / raw)
  To: Scott Chacon; +Cc: git list

Scott Chacon wrote:
> Has anyone watched this yet?
> 
> http://code.google.com/events/io/sessions/MercurialBigTable.html
> 
> It's kind of interesting - a Googler talks about getting Mercurial
> running on BigTable.  What fascinates me is that if I'm not horribly
> mistaken, it seems like they just threw out the revlog format entirely
> and just store the data in a key-value store as sort of a Git-like
> content addressable filesystem.

It does indeed seem like that, yes. Would have been fun to be there to
congratulate him on implementing something that's already existed for
about three years ;-)

>  I had thought they were taking
> advantage of the revlog structure somehow, but it appears like they
> basically just changed the underlying data format to be much more like
> Git and rewrote ah Hg speaking server on top of that.  They even
> explicitly store the head values like refs instead of reading
> childless nodes out of the revlog, which is what I thought Hg did.
> 

Well, storing the head values as refs is the only thing that makes
sense if you're using a database to track things, since you'd otherwise
have to map in too much data to get any sort of performance at all
out of it.

> Does anyone know how they do the graph walking efficiently with this
> structure?  He mentioned it was about half as fast as native Hg, but
> that seemed to be acceptable.

Yes, so they don't. DAG walking means they have to look up several
changesets in a linear fashion, but if they don't know the order
up front they'll have to suffer the penalty of actually fetching
each commit from the bigtable database over the network. It would
be similar to storing git objects in a database on a different
host, which would also be quite a lot slower than just hitting an
mmap()'ed file in binary form.

>  Curious if anyone had any thoughts or
> information on this.  Shawn, are there technical reasons why this
> works well the way they're doing it for Hg but would not for Git (like
> in the repo MINA based server)?  It looks like the data structure and
> protocol exchange are incredibly similar after they threw away all the
> revlog stuff.  Or is it just that they're fine with the speed loss and
> the Android project would not be?
> 

I'm more curious as to why they didn't choose git. The only explanation
that was actually true is that hg works well over HTTP (if you can call
3 network requests per not-up-to-date head "well"). Since I can't imagine
them not doing proper research before launching a project that almost
certainly cost quite a lot of money, and I personally think that the
"http rules all" explanation sounded weak, I'm guessing there were other
reasons as to why they didn't go with git instead, and I'm fairly curious
to hear them. If I was to take a guess, I'd say git is written in a pretty
unfriendly way for implementing other storage engines.

Ah well. In a year or two they'll probably support git as well. One can
hope at least ;-)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mercurial on BigTable
  2009-06-11  2:02 ` Andreas Ericsson
@ 2009-06-11  8:24   ` Jakub Narebski
  2009-06-12  3:46     ` Shawn O. Pearce
  2009-06-11 14:37   ` Sitaram Chamarty
  1 sibling, 1 reply; 8+ messages in thread
From: Jakub Narebski @ 2009-06-11  8:24 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Scott Chacon, git list

Andreas Ericsson <ae@op5.se> writes:

> I'm more curious as to why they didn't choose git. The only explanation
> that was actually true is that hg works well over HTTP (if you can call
> 3 network requests per not-up-to-date head "well"). Since I can't imagine
> them not doing proper research before launching a project that almost
> certainly cost quite a lot of money, and I personally think that the
> "http rules all" explanation sounded weak, I'm guessing there were other
> reasons as to why they didn't go with git instead, and I'm fairly curious
> to hear them. If I was to take a guess, I'd say git is written in a pretty
> unfriendly way for implementing other storage engines.

Well, Google App Engine was in Python, so it follows that the crew
would have it easier understanding Mercurial code (which is written in
Python with parts in C for performance), and in moving it to BigTable.
Adding Java to Gogle App Engine is, as far as I know, fairly recent;
additionally JGit (git implementation in Java) is not yet full
implementation.

I don't know if Git would be easy to implement on BigTable, and
whether it wouldn't be better for performance to try to implement it
on top of underlying Google File System (GFS) and Chubby Lock Service
_directly_...

Sidenote: lack of good HTTP protocol support (there are some numbers
at the bottom of comparison[1], but not enough detail to satisfy) as a
reason is especially strange now that there was quite long discussion
designing git-over-HTTP ("smart" HTTP protocol); cleaning warts in git
pack protocol, working around HTTP being stateless, ensuring backward
compatibility, ensuring that it would work well with HTTP caches...

But that is the problem with detailed research for "fast moving
target". Good research takes time, and by the time you finished it its
results are already obsolete...

[1] http://code.google.com/p/support/wiki/DVCSAnalysis

> 
> Ah well. In a year or two they'll probably support git as well. One can
> hope at least ;-)

Let's hope to that...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mercurial on BigTable
  2009-06-11  8:24   ` Jakub Narebski
@ 2009-06-12  3:46     ` Shawn O. Pearce
  2009-06-12  7:14       ` Jakub Narebski
  0 siblings, 1 reply; 8+ messages in thread
From: Shawn O. Pearce @ 2009-06-12  3:46 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Andreas Ericsson, Scott Chacon, git list

Jakub Narebski <jnareb@gmail.com> wrote:
> Andreas Ericsson <ae@op5.se> writes:
> 
> > I'm more curious as to why they didn't choose git. The only explanation
> > that was actually true is that hg works well over HTTP 
> 
> Well, Google App Engine was in Python, so it follows that the crew
> would have it easier understanding Mercurial code (which is written in
> Python with parts in C for performance), and in moving it to BigTable.

This has nothing to do with Google AppEngine.  GAE has CPU and
bandwidth limitations in place that make running a source code server
like Hg on it impossible.  E.g. the maximum size you could download
in a single HTTP request was 1 MB, now its up to 10 MB (IIRC).
The Hg hosting runs in a different cluster than the GAE hosting does,
and are managed by different teams.

> Adding Java to Gogle App Engine is, as far as I know, fairly recent;

True, yes, GAE Java support is fairly new.

> additionally JGit (git implementation in Java) is not yet full
> implementation.

JGit implements sufficient parts of Git to be a full server, and
could power a hosting site... indeed it powers Gerrit Code Review,
which some companies do use as their entire Git server solution,
rather than e.g. Gitosis.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mercurial on BigTable
  2009-06-12  3:46     ` Shawn O. Pearce
@ 2009-06-12  7:14       ` Jakub Narebski
  0 siblings, 0 replies; 8+ messages in thread
From: Jakub Narebski @ 2009-06-12  7:14 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Andreas Ericsson, Scott Chacon, git list

On Fri, 12 June 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
> > Andreas Ericsson <ae@op5.se> writes:
> > 
> > > I'm more curious as to why they didn't choose git. The only explanation
> > > that was actually true is that hg works well over HTTP 
> > 
> > Well, Google App Engine was in Python, so it follows that the crew
> > would have it easier understanding Mercurial code (which is written in
> > Python with parts in C for performance), and in moving it to BigTable.
> 
> This has nothing to do with Google AppEngine.  GAE has CPU and
> bandwidth limitations in place that make running a source code server
> like Hg on it impossible.  E.g. the maximum size you could download
> in a single HTTP request was 1 MB, now its up to 10 MB (IIRC).
> The Hg hosting runs in a different cluster than the GAE hosting does,
> and are managed by different teams.
> 
> > Adding Java to Gogle App Engine is, as far as I know, fairly recent;
> 
> True, yes, GAE Java support is fairly new.

I didn't wanted to say that support for Mercurial (or other DVCS) in
Google Code had anything to do with Google AppEngine.  Rather I wanted
to imply that what might have mattered when choosing between Mercurial
and Git was the fact that there was large pool of people who are 
proficient in Python (and later in Java) _and_ with web application(s).

But I don't quite see how lack of good over HTTP support should matter;
if they are rewriting Mercurial[1] (and earlier Subversion) to use 
BigTable, couldn't they add "smart" HTTP protocol support to Git 
(which I think would be easier)?

[1] Mercurial is not AGPLv3 licensed...

> 
> > additionally JGit (git implementation in Java) is not yet full
> > implementation.
> 
> JGit implements sufficient parts of Git to be a full server, and
> could power a hosting site... indeed it powers Gerrit Code Review,
> which some companies do use as their entire Git server solution,
> rather than e.g. Gitosis.

By "not full implementation" I meant here that as far as I know JGit
doesn't have yet support for _creating_ (as opposed to simply reusing)
deltas in packfiles.

P.S. On the other hand, Mercurial support for multiple branches is
not so good[2], and the way tags got implemented seems iffy to me...

[2] http://schacon.github.com/2008/11/24/on-mercurial.html

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mercurial on BigTable
  2009-06-11  2:02 ` Andreas Ericsson
  2009-06-11  8:24   ` Jakub Narebski
@ 2009-06-11 14:37   ` Sitaram Chamarty
  1 sibling, 0 replies; 8+ messages in thread
From: Sitaram Chamarty @ 2009-06-11 14:37 UTC (permalink / raw)
  To: git

On 2009-06-11 02:02:45, Andreas Ericsson <ae@op5.se> wrote:
> "http rules all" explanation sounded weak, I'm guessing there were other
> reasons as to why they didn't go with git instead, and I'm fairly curious
> to hear them. If I was to take a guess, I'd say git is written in a pretty
> unfriendly way for implementing other storage engines.

err, umm...

http://permalink.gmane.org/gmane.comp.version-control.git/117588

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Mercurial on BigTable
  2009-06-10 19:15 Mercurial on BigTable Scott Chacon
  2009-06-10 19:23 ` Sverre Rabbelier
  2009-06-11  2:02 ` Andreas Ericsson
@ 2009-06-12  4:14 ` Shawn O. Pearce
  2 siblings, 0 replies; 8+ messages in thread
From: Shawn O. Pearce @ 2009-06-12  4:14 UTC (permalink / raw)
  To: Scott Chacon; +Cc: git list

Scott Chacon <schacon@gmail.com> wrote:
> Has anyone watched this yet?
> 
> http://code.google.com/events/io/sessions/MercurialBigTable.html

I hadn't seen that yet, thanks.

> It's kind of interesting - a Googler talks about getting Mercurial
> running on BigTable.  What fascinates me is that if I'm not horribly
> mistaken, it seems like they just threw out the revlog format entirely
> and just store the data in a key-value store as sort of a Git-like
> content addressable filesystem.

Almost... but not quite.  If you look at the way they store files
they embed the file path as part of the BigTable key.  This makes it
cheap to return all revisions between X and Y for any given file, as
its just a range scan over the keys.  Git doesn't do this normally.

In Hg, and in their implementation of it on BigTable, if a file
content is copied between two paths (same blob in git terms) they
actually duplicate the data, once under each path.  We could do
something like that in Git... and just pay the price on copy, and
then you can get a storage layout like they do, and have it scale
well onto a larger system.  But... pack size will suffer in what
the client receives, it will be bigger.

> Does anyone know how they do the graph walking efficiently with this
> structure?  He mentioned it was about half as fast as native Hg, but
> that seemed to be acceptable.  Curious if anyone had any thoughts or
> information on this.  Shawn, are there technical reasons why this
> works well the way they're doing it for Hg but would not for Git (like
> in the repo MINA based server)?  It looks like the data structure and
> protocol exchange are incredibly similar after they threw away all the
> revlog stuff.

I think they also added more pointers and data caches that don't
exist in Hg normally, but exist in their BigTable backend.  Like
precomputing pointers from a commit to the most recent ancestor
that is a merge, i think that was mentioned in the talk.

The JGit/MINA based servers run git "well enough", but that's off
local disk, and we do pay a good price compared to C Git.  E.g.
we really need a revcache to accelerate the object enumeration phase,
that takes ages in JGit.  And indexing a pushed pack is rather slow
compared to C Git, a large push could take up to a minute or two
to fully index and fsck.

> Or is it just that they're fine with the speed loss and
> the Android project would not be?

What does Android have to do with Hg?  Android went with Git for
a lot of reasons, none of them having to do with the performance
or availability of Hg on code.google.com.  All of them had to do
with Git being a really solid DVCS that has a very bright future.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-06-12  7:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-10 19:15 Mercurial on BigTable Scott Chacon
2009-06-10 19:23 ` Sverre Rabbelier
2009-06-11  2:02 ` Andreas Ericsson
2009-06-11  8:24   ` Jakub Narebski
2009-06-12  3:46     ` Shawn O. Pearce
2009-06-12  7:14       ` Jakub Narebski
2009-06-11 14:37   ` Sitaram Chamarty
2009-06-12  4:14 ` Shawn O. Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).