git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: mhagger@alum.mit.edu
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
	Drew Northup <drew.northup@maine.edu>,
	Jakub Narebski <jnareb@gmail.com>,
	Heiko Voigt <hvoigt@hvoigt.net>,
	Johan Herland <johan@herland.net>,
	Julian Phillips <julian@quantumfyre.co.uk>,
	Michael Haggerty <mhagger@alum.mit.edu>
Subject: [PATCH 00/28] Store references hierarchically in cache
Date: Fri, 28 Oct 2011 14:28:13 +0200	[thread overview]
Message-ID: <1319804921-17545-1-git-send-email-mhagger@alum.mit.edu> (raw)

From: Michael Haggerty <mhagger@alum.mit.edu>

This patch series applies on top of v3 of "Use refs API more
consistently", which ultimately branches from gitster/master.

This patch series changes how references are stored in the reference
cache data structures (entirely internal to refs.c).  Previously, the
packed refs were stored in one big array of pointers-to-struct, and
the loose refs were stored in another.  This had a few problems:

* Whenever any loose refs were needed, the whole refs tree had to be
  read from disk into memory.  This can be quite expensive when there
  are a lot of loose references.  And it is often overkill, as
  frequently only a small part of the refs tree is needed.  For
  example:

  * refs/replace is almost *always* needed even though it often
    doesn't even exist.  Thus the presence of many loose references
    slows down *many* git commands for no reason whatsoever.

  * When a new reference is created, is_refname_available() is called
    to see whether there is another another reference whose name
    conflicts with the new one.  Currently this loads and iterates
    through *all* references.  But there are only a few refnames that
    can possibly conflict; for example, given the refname
    "refs/heads/foo/bar", the only possible conflicts are with
    "refs/heads/foo" and "refs/heads/foo/bar/*".  Therefore it is
    silly to load and iterate through the whole refname hierarchy.

  * "git for-each-ref" is capable of searching a subtree of the
    references.  But currently this causes all references to be
    loaded.

Therefore, this patch series changes the data structure used to store
the reference cache from a single array of pointers-to-struct into a
tree-like structure in which each subdirectory of the reference
namespace is stored as an array of pointers-to-entry and entries can
be either references or subdirectories containing more references.
Moreover, each subdirectory of loose references is only read from disk
when a reference from that subdirectory (or one of its descendants) is
needed.  This slightly slows down commands that need to iterate
through all references (simply because the new data structures are
more complicated), but it *dramatically* decreases the time needed for
some common operations.  For example, in a test repository with 20000
revisions and 10000 loose tags:

* the time to create a new branch goes from 180 ms to less than 10 ms
  (my test resolution only includes two decimal places) and the time
  to checkout a new branch does the same.

* the time for a "git filter-branch" of all commits (which used to
  scale like N^2) goes from 4 hours to 13 minutes.  (Since
  filter-branch necessarily *creates* lots of loose references, the
  savings are also there if the references are originally packed.)

The efficiency gains are such that some operations are now faster with
loose references than with packed references; however, some operations
with packed references slow down a bit.

These changes do not increase the amount of space per reference needed
for the reference cache, but they do add one similarly-sized entry for
each subdirectory (for each of loose and packed).  I don't think that
the space increase should be significant in any reasonable situation.

After these changes, there is a benefit to sharding the reference
namespace, especially for loose references.


The patches:

The first few patches are just preparation.

Patch 6 "refs: store references hierarchically" introduces the big
data structure change.  It causes much extra sorting to be done and
therefore slows down performance significantly, but the following two
commits restore performance to approximately the old level.

Patches 11-24 change most of the internal functions to work with
"struct ref_entry *" (namely the kind of ref_entry that holds a
directory of references) instead of "struct ref_dir *".  The reason
for this change it to allow these functions access to the "flag" and
"name" fields that are stored in ref_entry and thereby avoid having to
store redundant information in "struct ref_dir" (which would increase
the size of *every* ref_entry because of its presence in the union).

The other patches (9-10 and 25-28) reap the benefits of the new data
structure:

* do_for_each_ref() can iterate only over the desired subtree instead
  of iterating over all references and filtering out the unwanted ones

* loose references can be read on demand, one directory at a time

* is_refname_available() can query only the possibly-conflicting parts
  of the reference namespace

Michael Haggerty (28):
  refs.c: reorder definitions more logically
  free_ref_entry(): new function
  check_refname_component(): return 0 for zero-length components
  struct ref_entry: nest the value part in a union
  refs.c: rename ref_array -> ref_dir
  refs: store references hierarchically
  sort_ref_dir(): do not sort if already sorted
  refs: sort ref_dirs lazily
  do_for_each_ref(): only iterate over the subtree that was requested
  get_ref_dir(): keep track of the current ref_dir
  refs: wrap top-level ref_dirs in ref_entries
  get_packed_refs(): return (ref_entry *) instead of (ref_dir *)
  get_loose_refs(): return (ref_entry *) instead of (ref_dir *)
  is_refname_available(): take (ref_entry *) instead of (ref_dir *)
  find_ref(): take (ref_entry *) instead of (ref_dir *)
  read_packed_refs(): take (ref_entry *) instead of (ref_dir *)
  add_ref(): take (ref_entry *) instead of (ref_dir *)
  find_containing_direntry(): use (ref_entry *) instead of (ref_dir *)
  search_ref_dir(): take (ref_entry *) instead of (ref_dir *)
  add_entry(): take (ref_entry *) instead of (ref_dir *)
  do_for_each_ref_in_dir*(): take (ref_entry *) instead of (ref_dir *)
  sort_ref_dir(): take (ref_entry *) instead of (ref_dir *)
  struct ref_dir: store a reference to the enclosing ref_cache
  read_loose_refs(): take a (ref_entry *) as argument
  refs: read loose references lazily
  is_refname_available(): query only possibly-conflicting references
  read_packed_refs(): keep track of the directory being worked in
  repack_without_ref(): call clear_packed_ref_cache()

 refs.c | 1268 ++++++++++++++++++++++++++++++++++++++++++---------------------
 refs.h |    7 +-
 2 files changed, 850 insertions(+), 425 deletions(-)

-- 
1.7.7

             reply	other threads:[~2011-10-28 12:29 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-28 12:28 mhagger [this message]
2011-10-28 12:28 ` [PATCH 01/28] refs.c: reorder definitions more logically mhagger
2011-10-28 12:28 ` [PATCH 02/28] free_ref_entry(): new function mhagger
2011-10-28 12:28 ` [PATCH 03/28] check_refname_component(): return 0 for zero-length components mhagger
2011-10-28 12:28 ` [PATCH 04/28] struct ref_entry: nest the value part in a union mhagger
2011-10-28 12:28 ` [PATCH 05/28] refs.c: rename ref_array -> ref_dir mhagger
2011-10-28 12:28 ` [PATCH 06/28] refs: store references hierarchically mhagger
2011-10-28 12:28 ` [PATCH 07/28] sort_ref_dir(): do not sort if already sorted mhagger
2011-10-28 12:28 ` [PATCH 08/28] refs: sort ref_dirs lazily mhagger
2011-10-28 12:28 ` [PATCH 09/28] do_for_each_ref(): only iterate over the subtree that was requested mhagger
2011-10-28 12:28 ` [PATCH 10/28] get_ref_dir(): keep track of the current ref_dir mhagger
2011-10-28 12:28 ` [PATCH 11/28] refs: wrap top-level ref_dirs in ref_entries mhagger
2011-10-28 12:28 ` [PATCH 12/28] get_packed_refs(): return (ref_entry *) instead of (ref_dir *) mhagger
2011-10-28 12:28 ` [PATCH 13/28] get_loose_refs(): " mhagger
2011-10-28 12:28 ` [PATCH 14/28] is_refname_available(): take " mhagger
2011-10-28 12:28 ` [PATCH 15/28] find_ref(): " mhagger
2011-10-28 12:28 ` [PATCH 16/28] read_packed_refs(): " mhagger
2011-10-28 12:28 ` [PATCH 17/28] add_ref(): " mhagger
2011-10-28 12:28 ` [PATCH 18/28] find_containing_direntry(): use " mhagger
2011-10-28 12:28 ` [PATCH 19/28] search_ref_dir(): take " mhagger
2011-10-28 12:28 ` [PATCH 20/28] add_entry(): " mhagger
2011-10-28 12:28 ` [PATCH 21/28] do_for_each_ref_in_dir*(): " mhagger
2011-10-28 12:28 ` [PATCH 22/28] sort_ref_dir(): " mhagger
2011-10-28 12:28 ` [PATCH 23/28] struct ref_dir: store a reference to the enclosing ref_cache mhagger
2011-10-28 12:28 ` [PATCH 24/28] read_loose_refs(): take a (ref_entry *) as argument mhagger
2011-10-28 12:28 ` [PATCH 25/28] refs: read loose references lazily mhagger
2011-10-28 12:28 ` [PATCH 26/28] is_refname_available(): query only possibly-conflicting references mhagger
2011-11-15  5:55   ` [PATCH] Fix "is_refname_available(): query only possibly-conflicting references" mhagger
2011-11-15  7:24     ` Junio C Hamano
2011-11-15 16:19       ` Michael Haggerty
2011-11-15 19:19         ` Junio C Hamano
2011-10-28 12:28 ` [PATCH 27/28] read_packed_refs(): keep track of the directory being worked in mhagger
2011-10-28 12:28 ` [PATCH 28/28] repack_without_ref(): call clear_packed_ref_cache() mhagger
2011-10-28 13:07 ` [PATCH 00/28] Store references hierarchically in cache Ramkumar Ramachandra
2011-10-28 18:45   ` Michael Haggerty
2011-11-16 12:51 ` [PATCH 00/28] Store references hierarchically in cache -- benchmark results Michael Haggerty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1319804921-17545-1-git-send-email-mhagger@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=drew.northup@maine.edu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=hvoigt@hvoigt.net \
    --cc=jnareb@gmail.com \
    --cc=johan@herland.net \
    --cc=julian@quantumfyre.co.uk \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).