From: mhagger@alum.mit.edu
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
Drew Northup <drew.northup@maine.edu>,
Jakub Narebski <jnareb@gmail.com>,
Heiko Voigt <hvoigt@hvoigt.net>,
Johan Herland <johan@herland.net>,
Julian Phillips <julian@quantumfyre.co.uk>,
Michael Haggerty <mhagger@alum.mit.edu>
Subject: [PATCH 00/28] Store references hierarchically in cache
Date: Fri, 28 Oct 2011 14:28:13 +0200 [thread overview]
Message-ID: <1319804921-17545-1-git-send-email-mhagger@alum.mit.edu> (raw)
From: Michael Haggerty <mhagger@alum.mit.edu>
This patch series applies on top of v3 of "Use refs API more
consistently", which ultimately branches from gitster/master.
This patch series changes how references are stored in the reference
cache data structures (entirely internal to refs.c). Previously, the
packed refs were stored in one big array of pointers-to-struct, and
the loose refs were stored in another. This had a few problems:
* Whenever any loose refs were needed, the whole refs tree had to be
read from disk into memory. This can be quite expensive when there
are a lot of loose references. And it is often overkill, as
frequently only a small part of the refs tree is needed. For
example:
* refs/replace is almost *always* needed even though it often
doesn't even exist. Thus the presence of many loose references
slows down *many* git commands for no reason whatsoever.
* When a new reference is created, is_refname_available() is called
to see whether there is another another reference whose name
conflicts with the new one. Currently this loads and iterates
through *all* references. But there are only a few refnames that
can possibly conflict; for example, given the refname
"refs/heads/foo/bar", the only possible conflicts are with
"refs/heads/foo" and "refs/heads/foo/bar/*". Therefore it is
silly to load and iterate through the whole refname hierarchy.
* "git for-each-ref" is capable of searching a subtree of the
references. But currently this causes all references to be
loaded.
Therefore, this patch series changes the data structure used to store
the reference cache from a single array of pointers-to-struct into a
tree-like structure in which each subdirectory of the reference
namespace is stored as an array of pointers-to-entry and entries can
be either references or subdirectories containing more references.
Moreover, each subdirectory of loose references is only read from disk
when a reference from that subdirectory (or one of its descendants) is
needed. This slightly slows down commands that need to iterate
through all references (simply because the new data structures are
more complicated), but it *dramatically* decreases the time needed for
some common operations. For example, in a test repository with 20000
revisions and 10000 loose tags:
* the time to create a new branch goes from 180 ms to less than 10 ms
(my test resolution only includes two decimal places) and the time
to checkout a new branch does the same.
* the time for a "git filter-branch" of all commits (which used to
scale like N^2) goes from 4 hours to 13 minutes. (Since
filter-branch necessarily *creates* lots of loose references, the
savings are also there if the references are originally packed.)
The efficiency gains are such that some operations are now faster with
loose references than with packed references; however, some operations
with packed references slow down a bit.
These changes do not increase the amount of space per reference needed
for the reference cache, but they do add one similarly-sized entry for
each subdirectory (for each of loose and packed). I don't think that
the space increase should be significant in any reasonable situation.
After these changes, there is a benefit to sharding the reference
namespace, especially for loose references.
The patches:
The first few patches are just preparation.
Patch 6 "refs: store references hierarchically" introduces the big
data structure change. It causes much extra sorting to be done and
therefore slows down performance significantly, but the following two
commits restore performance to approximately the old level.
Patches 11-24 change most of the internal functions to work with
"struct ref_entry *" (namely the kind of ref_entry that holds a
directory of references) instead of "struct ref_dir *". The reason
for this change it to allow these functions access to the "flag" and
"name" fields that are stored in ref_entry and thereby avoid having to
store redundant information in "struct ref_dir" (which would increase
the size of *every* ref_entry because of its presence in the union).
The other patches (9-10 and 25-28) reap the benefits of the new data
structure:
* do_for_each_ref() can iterate only over the desired subtree instead
of iterating over all references and filtering out the unwanted ones
* loose references can be read on demand, one directory at a time
* is_refname_available() can query only the possibly-conflicting parts
of the reference namespace
Michael Haggerty (28):
refs.c: reorder definitions more logically
free_ref_entry(): new function
check_refname_component(): return 0 for zero-length components
struct ref_entry: nest the value part in a union
refs.c: rename ref_array -> ref_dir
refs: store references hierarchically
sort_ref_dir(): do not sort if already sorted
refs: sort ref_dirs lazily
do_for_each_ref(): only iterate over the subtree that was requested
get_ref_dir(): keep track of the current ref_dir
refs: wrap top-level ref_dirs in ref_entries
get_packed_refs(): return (ref_entry *) instead of (ref_dir *)
get_loose_refs(): return (ref_entry *) instead of (ref_dir *)
is_refname_available(): take (ref_entry *) instead of (ref_dir *)
find_ref(): take (ref_entry *) instead of (ref_dir *)
read_packed_refs(): take (ref_entry *) instead of (ref_dir *)
add_ref(): take (ref_entry *) instead of (ref_dir *)
find_containing_direntry(): use (ref_entry *) instead of (ref_dir *)
search_ref_dir(): take (ref_entry *) instead of (ref_dir *)
add_entry(): take (ref_entry *) instead of (ref_dir *)
do_for_each_ref_in_dir*(): take (ref_entry *) instead of (ref_dir *)
sort_ref_dir(): take (ref_entry *) instead of (ref_dir *)
struct ref_dir: store a reference to the enclosing ref_cache
read_loose_refs(): take a (ref_entry *) as argument
refs: read loose references lazily
is_refname_available(): query only possibly-conflicting references
read_packed_refs(): keep track of the directory being worked in
repack_without_ref(): call clear_packed_ref_cache()
refs.c | 1268 ++++++++++++++++++++++++++++++++++++++++++---------------------
refs.h | 7 +-
2 files changed, 850 insertions(+), 425 deletions(-)
--
1.7.7
next reply other threads:[~2011-10-28 12:29 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-28 12:28 mhagger [this message]
2011-10-28 12:28 ` [PATCH 01/28] refs.c: reorder definitions more logically mhagger
2011-10-28 12:28 ` [PATCH 02/28] free_ref_entry(): new function mhagger
2011-10-28 12:28 ` [PATCH 03/28] check_refname_component(): return 0 for zero-length components mhagger
2011-10-28 12:28 ` [PATCH 04/28] struct ref_entry: nest the value part in a union mhagger
2011-10-28 12:28 ` [PATCH 05/28] refs.c: rename ref_array -> ref_dir mhagger
2011-10-28 12:28 ` [PATCH 06/28] refs: store references hierarchically mhagger
2011-10-28 12:28 ` [PATCH 07/28] sort_ref_dir(): do not sort if already sorted mhagger
2011-10-28 12:28 ` [PATCH 08/28] refs: sort ref_dirs lazily mhagger
2011-10-28 12:28 ` [PATCH 09/28] do_for_each_ref(): only iterate over the subtree that was requested mhagger
2011-10-28 12:28 ` [PATCH 10/28] get_ref_dir(): keep track of the current ref_dir mhagger
2011-10-28 12:28 ` [PATCH 11/28] refs: wrap top-level ref_dirs in ref_entries mhagger
2011-10-28 12:28 ` [PATCH 12/28] get_packed_refs(): return (ref_entry *) instead of (ref_dir *) mhagger
2011-10-28 12:28 ` [PATCH 13/28] get_loose_refs(): " mhagger
2011-10-28 12:28 ` [PATCH 14/28] is_refname_available(): take " mhagger
2011-10-28 12:28 ` [PATCH 15/28] find_ref(): " mhagger
2011-10-28 12:28 ` [PATCH 16/28] read_packed_refs(): " mhagger
2011-10-28 12:28 ` [PATCH 17/28] add_ref(): " mhagger
2011-10-28 12:28 ` [PATCH 18/28] find_containing_direntry(): use " mhagger
2011-10-28 12:28 ` [PATCH 19/28] search_ref_dir(): take " mhagger
2011-10-28 12:28 ` [PATCH 20/28] add_entry(): " mhagger
2011-10-28 12:28 ` [PATCH 21/28] do_for_each_ref_in_dir*(): " mhagger
2011-10-28 12:28 ` [PATCH 22/28] sort_ref_dir(): " mhagger
2011-10-28 12:28 ` [PATCH 23/28] struct ref_dir: store a reference to the enclosing ref_cache mhagger
2011-10-28 12:28 ` [PATCH 24/28] read_loose_refs(): take a (ref_entry *) as argument mhagger
2011-10-28 12:28 ` [PATCH 25/28] refs: read loose references lazily mhagger
2011-10-28 12:28 ` [PATCH 26/28] is_refname_available(): query only possibly-conflicting references mhagger
2011-11-15 5:55 ` [PATCH] Fix "is_refname_available(): query only possibly-conflicting references" mhagger
2011-11-15 7:24 ` Junio C Hamano
2011-11-15 16:19 ` Michael Haggerty
2011-11-15 19:19 ` Junio C Hamano
2011-10-28 12:28 ` [PATCH 27/28] read_packed_refs(): keep track of the directory being worked in mhagger
2011-10-28 12:28 ` [PATCH 28/28] repack_without_ref(): call clear_packed_ref_cache() mhagger
2011-10-28 13:07 ` [PATCH 00/28] Store references hierarchically in cache Ramkumar Ramachandra
2011-10-28 18:45 ` Michael Haggerty
2011-11-16 12:51 ` [PATCH 00/28] Store references hierarchically in cache -- benchmark results Michael Haggerty
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1319804921-17545-1-git-send-email-mhagger@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=drew.northup@maine.edu \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=hvoigt@hvoigt.net \
--cc=jnareb@gmail.com \
--cc=johan@herland.net \
--cc=julian@quantumfyre.co.uk \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).