Git development
 help / color / mirror / Atom feed
* JGit server performance
@ 2009-03-08  3:22 Shawn O. Pearce
       [not found] ` <d411cc4a0903090801w7748d26pb821a7bfb3db660@mail.gmail.com>
  0 siblings, 1 reply; 2+ messages in thread
From: Shawn O. Pearce @ 2009-03-08  3:22 UTC (permalink / raw)
  To: git

As Gerrit Code Review provides Gitosis-like functionality, but
that is implemented through JGit's pure Java implementation, the
performance of JGit's UploadPack matters to me.

JGit is about 66% slower on my hardware when cloning the linux-2.6
repository, compared to git-core (jgit 2m21s, git-core 1m23s).

The bottlenecks are:

  ~41.2% in ObjectLoader.getCachedBytes()

    This is the tree objects being parsed out of the pack file.
    The problem here (I believe) is we have horrible locality
    when reading.  The delta base for a tree isn't in memory most
    of the time, because its been evicted by other trees accessed
    since the last time that tree was touched.

    Conceptually this makes some sense, as ObjectWalk does a depth
    first traversal through the tree of each commit, in most-recent
    to least-recent commit order.  On a larger project like the
    kernel we'll touch a lot more objects between two root trees,
    and there isn't even any guarantee that two root trees that
    appear near each other in the commit sequence have a delta
    base relationship.

  ~20.5% in AbstractTreeIterator.getEntryObjectId()

    The bulk of the time here is really down in NB.decodeInt32().
    We spend a lot of time converting an object id in a tree data
    stream into an AnyObjectId (really a reused MutableObjectId)
    so that we can probe the ObjectIdSubclassMap to see if we have
    seen this object before.

    The sad fact is, we need all 20 bytes to be converted into the 5
    words, because the majority of the time, we have actually seen
    the object before, and it exists in our hash table.  The only
    way to know is to convert and compare all 5 words.  Any attempt
    to lazily convert the 5 words would just make it slower.

... and it falls off from there.

I'm at a loss on how to improve the performance.  I don't think that
we can do anything about the 20% in getEntryObjectId() due to the
way our data structures are organized around the 5-word ObjectId,
and not a byte[].  That 20% is a penalty git-core doesn't have to
pay, and is most certainly one reason why JGit is so much slower.

The only thing that may work is to modify ObjectWalk to try and
deduce some delta-chain locality from the pack.  Buffer up objects
that it needs to parse in a queue, rank them by the delta base
they would need to use, and then try to unfold the base first,
and then the children.

That is, do something like what IndexPack does, where we try to
unpack each object exactly once, and recursively process the delta
chain children after unpacking the parent.

We _might_ get better locality if we can queue up all root trees,
process all of them, then process the first level children, etc,
so go breadth first.

But that seems like a lot of code, and it probably wrecks the simple
recency ordering produced natively by ObjectWalk.  :-\

-- 
Shawn.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: JGit server performance
       [not found] ` <d411cc4a0903090801w7748d26pb821a7bfb3db660@mail.gmail.com>
@ 2009-03-09 15:27   ` Shawn O. Pearce
  0 siblings, 0 replies; 2+ messages in thread
From: Shawn O. Pearce @ 2009-03-09 15:27 UTC (permalink / raw)
  To: Scott Chacon; +Cc: git

[+git]

Scott Chacon <schacon@gmail.com> wrote:
> On Sat, Mar 7, 2009 at 8:22 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> > As Gerrit Code Review provides Gitosis-like functionality, but
> > that is implemented through JGit's pure Java implementation, the
> > performance of JGit's UploadPack matters to me.
> >
> > JGit is about 66% slower on my hardware when cloning the linux-2.6
> > repository, compared to git-core (jgit 2m21s, git-core 1m23s).
> 
> What about packfile caching?  I thought I just saw someone propose
> this for a GSOC project - it obviously wouldn't help with making the
> first clone faster, but since it seems that you're talking about a
> server that is probably recomputing the same packfile bits over and
> over, perhaps caching the final output would be a simpler way to get
> more aggregate time savings.  I remember there were some issues with
> client capabilities and whatnot, but would that be a useful solution
> to speeding up overall throughput?

Indeed.

I think over the weekend I came to the conclusion that the only
way I can "fix" this is to implement some form of automatic caching
inside of JGit.

I came up with two ideas, but haven't tried either one yet:

Cache Object List:

  The idea here is that JGit spends the majority of its server side
  resources enumerating objects, doing the `git rev-list --objects
  $wants --not $haves` computation.  Once you reach a certain point
  on the DAG, that computation just isn't going to change much.

  So, for example, if we tag a dump of the object listing for all
  objects that are reachable from the most recently reachable
  annotated tag, and cache that dump, we can reload it when we
  encounter that tag again.  By restoring that dump, we can cut
  off a huge part of the object evaluation.

  Initial clones would always benefit from this dump, and we can
  expand it to cover more of the DAG as new tags are produced.

  The really recent commits still require enumeration, but this
  is a small percentage of the total objects being processed,
  and should go quite quickly.

  We still need to select and serve ranges of existing packs,
  based on the objects chosen for transmission.

Cache Packs:

  As you point out above, do the GSoC project idea (but in JGit)
  where we cache packs.

  My problem with this approach is it doesn't help initial clones
  very much once any one of the refs changes.  As soon as any
  ref is updated in any way, that cached pack is invalid and must
  be recomputed from scratch.

  However, it probably would generalize nicely into supporting a
  decent cache for incremental updates of clients.  In many cases
  in a Linus/Junio single-push-a-day world, all clients would be
  roughly asking for the same pack, and we'd get a decent return on
  our investment by saving the thin pack to disk in some cache area.

  But in a Gerrit Code Review world, its expected that the
  branches will update hundreds of times per day, and clients
  will be fetching off those multiple times throughout the day.
  So the life span of any given cached pack is probably quite small.

  However, I have no real proof to back this theory up with yet,
  its just back of the envelope guessing on my part.

Hybrid:

  I also considered some sort of hybrid approach.

  E.g. create a cache pack for everything reachable from a stable
  tag point (the object set in that pack is the same as the cached
  object list).  When sending data to a client, we send the current
  objects that aren't in the big cache pack, and then we tack the
  pack onto the end of the data stream.  So we only need to "read,
  SHA-1, send" through the entire length of the file (minus the 12
  byte header, 20 byte footer).
  
  Really quite efficient.

  My problem with this approach is you cannot duplicate an object
  in a pack stream.  So I have to be very careful when using this
  pack to complete a larger pack stream, as I need to avoid writing
  any object in the front if it will appear in the pack I am tacking
  onto the end.

  That is more common that you might expect, e.g.  consider a
  "Documentation/SubmittingPatches" file that hasn't been touched in
  ages...  it would appear in the recent commit object enumeration,
  but it also will appear in the big cache pack I want to tack on
  the end, as it hasn't changed since the pack was created.


Right now I am leaning towards caching the object enumeration
(the first approach mentioned).

-- 
Shawn.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2009-03-09 15:29 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-08  3:22 JGit server performance Shawn O. Pearce
     [not found] ` <d411cc4a0903090801w7748d26pb821a7bfb3db660@mail.gmail.com>
2009-03-09 15:27   ` Shawn O. Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox