[PATCH 1/6 (v2)] revision caching documentation: man page and technical discussion

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/6 (v2)] revision caching documentation: man page and technical discussion
@ 2009-08-08  7:31 Nick Edelen
  2009-08-08 18:31 ` Junio C Hamano
  2009-08-16 22:41 ` Nick Edelen
  0 siblings, 2 replies; 5+ messages in thread
From: Nick Edelen @ 2009-08-08  7:31 UTC (permalink / raw)
  To: Junio C Hamano, Nicolas Pitre, Johannes Schindelin, Sam Vilain,
	Michael J Gruber

Before any code is introduced the full documentation is put forth.  This 
provides a man page for the porcelain, and a technical doc in technical/.  The 
latter describes the API, and discusses rev-cache's design, file format and 
mechanics.

Signed-off-by: Nick Edelen <sirnot@gmail.com>

---
Same as 1/5 in previous set.  The man page has been flushed out, and formatted 
more congruently with git's other man pages.  The technical discussion has been 
updated and tweaked for better formatting as well.  This has been in response 
to Sam's commentary.

 Documentation/git-rev-cache.txt       |   96 +++++++++
 Documentation/technical/rev-cache.txt |  379 +++++++++++++++++++++++++++++++++
 2 files changed, 475 insertions(+), 0 deletions(-)

diff --git a/Documentation/git-rev-cache.txt b/Documentation/git-rev-cache.txt
new file mode 100755
index 0000000..600cf64
--- /dev/null
+++ b/Documentation/git-rev-cache.txt
@@ -0,0 +1,96 @@
+git-rev-cache(1)
+================
+
+NAME
+----
+git-rev-cache - Add, walk and maintain revision cache slices.
+
+SYNOPSIS
+--------
+'git-rev-cache' COMMAND [options] [<commit>...]
+
+DESCRIPTION
+-----------
+A front end for the rev-cache API.  It is mainly intended for cache slice
+generation and maintenance, but can also walk commits within a slice.  It
+currently provides basic administrative functionality.
+
+COMMANDS
+--------
+
+add
+~~~
+Add revisions to the cache by creating a new cache slice.  Reads a revision
+list from the command line, formatted as: `START START ... \--not END END ...`
+
+Options:
+
+\--all::
+	Include all branch heads in the new cache slice.
+
+\--fresh::
+	Exclude everything already in the revision cache.  
+
+\--stdin::
+	Read newline-seperated revisions from the standard input.  Use \--not
+	to exclude commits, as on the command line.
+
+\--legs::
+	Ensure newly-generated cache slice is self-contained.  This means that
+	no commit has partially cached parents, in that all its parents are
+	cached or none of them are.
+
+\--noobjects::
+	Non-commit objects are normally included along with the commit with
+	which they were introduced.  This is obviously very benificial, but can
+	take longer in cache slice generation.  Use this option to disable
+	non-commit object caching.
+
+walk
+~~~~
+Walk a cache slice based on revision list, formatted like 'add'.
+
+Options:
+
+\--objects::
+	Like 'git-rev-list', 'walk' will normally only list commits.  Use this
+	option to list non-commit objects as well.
+
+fuse
+~~~~
+Merge several cache slices into a single large slice.  On each invocation of
+'add' a new file ("slice") is added to the revision cache directory, and after
+several additions the directory may become populated with many, relatively
+small slices.  Numerous smaller slices will yield poorer performance than a one
+or two large ones, because of the overhead of loading new slices into memory.
+
+Running 'fuse' every once in a while will solve this problem by coalescing all
+the cache slices into one larger slice.  For very large projects, using
+\--ignore-size is advisable to prevent overly large cache slices.
+
+Options:
+
+\--all::
+	Normally fuse will only include everything that's already in the
+	revision cache.  \--add tells it to start walking from the branch
+	heads, effectively an `add --all --fresh; fuse` (pseudo-command).
+
+\--noobjects::
+	As in 'add', this option disables inclusion of non-commit objects.  If
+	some cache slices do contain such objects, the information will be lost.
+
+\--ignore-size[=N]::
+	Do not merge cache slices of size >=N.  If N is not provided 'fuse'
+	will default to a size of ~25MB (be aware that slices must be mapped to
+	memory).
+
+index
+~~~~~
+Regenerate the cache index.  If the index file associating objects with cache
+slices gets corrupted, lost, or otherwise becomes unusable, 'index' will
+quickly regenerate the file.  'fuse' will also regenerate the index.
+
+DISCUSSION
+----------
+For an explanation of the API and its inner workings, see
+link:technical/rev-cache.txt[technical info on rev-cache].
diff --git a/Documentation/technical/rev-cache.txt b/Documentation/technical/rev-cache.txt
new file mode 100755
index 0000000..78d7218
--- /dev/null
+++ b/Documentation/technical/rev-cache.txt
@@ -0,0 +1,379 @@
+rev-cache
+=========
+
+The revision cache API ('rev-cache') provides a method for efficiently storing
+and accessing commit branch sections.  Such branch slices are defined by a
+series of top (interesting) and bottom (uninteresting) commits.  Each slice
+contains, per commit:
+
+* All intra-slice topological relations, encoded into path "channels" (see
+  'Mechanics' for full explanation).
+* Object meta-data: type, SHA-1, size, date (for commits).
+* Objects introduced in that commit, relative to slice (ie. only for non-start
+  commits).
+
+Storage data structures are not exported, in part to keep git's global scope
+clean, but largely because they're pretty much useless outside of rev-cache.
+
+The API
+-------
+
+The API for 'rev-cache' is quite simple.  You can find the function prototypes
+in `revision.h`.
+
+Data Structures
+~~~~~~~~~~~~~~~
+
+The `rev_cache_info` struct holds all the options and flags for the API.
+
+----
+struct rev_cache_info {
+        /* generation flags */
+        unsigned objects : 1,
+                legs : 1,
+                make_index : 1;
+        
+        /* traversal flags */
+        unsigned save_unique : 1,
+                add_to_pending : 1,
+                add_names : 1;
+        
+        /* fuse options */
+        unsigned int ignore_size;
+};
+----
+
+The fields:
+
+`objects`::
+	Add non-commit objects to slice.
+
+`legs`::
+	Ensure end commits have no children.
+
+`make_index`::
+	Integrate newly-made slice into index.
+
+`save_unique`::
+	Load non-commit objects into `unique` field of their associated
+	`commit` object.
+
+`add_to_pending`::
+	Append unique non-commit objects to the `pending` object list in the
+	passed `rev_info` instance.
+
+`add_names`::
+	Include non-commit object names in the pending object entries (if
+	`add_to_pending` is set), and set the `name` field of `blob` and `tree`
+	objects (if `save_unique`).
+
+`ignore_size`::
+	If non-zero, ignore slices with size greater or equal to this during
+fusion.
+
+Functions
+~~~~~~~~~
+
+`init_rci`::
+        Initiate a `rev_cache_info` struct to default options.
+
+`make_cache_slice`::
+        Create a cache based on an a `rev_info` instance or `commit_list` s of
+        "starts" and "ends" (defaulting to latter if `rev_info` pointer is
+        NULL), copying the cache SHA-1 into a passed pointer if non-zero.  A
+        `rev_cache_info` struct pointer can be passed to set options, while
+        passing NULL will set default options.  A last parameter can
+        optionally recieve the final cache hash.
+
+`make_cache_index`::
+        Add a slice to the cache index.  Requires a file descriptor, the cache
+        hash and the file size.  Note that this is normally called by
+        `make_cache_slice` under the `make_index` option.
+
+`get_cache_slice`::
+        Given a commit SHA-1 `get_cache_slice` will search the slice index and
+        return, if found, the cache-identifying SHA-1.
+
+`traverse_cache_slice`::
+        Traverse a specified cache slice based on:
+
+        * `rev_info` instance; options set in the `rev_cache_info` field.
+        * cache SHA-1 identifier
+        * a starting commit and commit work list
+        * date of oldest encountered interesting commit (passing 0 will tell
+          rev-cache to use defaults)
+        * current `slop` (this is a counter, a la `revision.c`, to determine
+          when to stop traversing; typically you can just pass 0 and rev-cache
+          will use default settings)
+        
++
+The output is sent to a FILO `commit_list` "queue", while any end commits 
+are passed back into the work list.  If the walk is not contained within the 
+slice, commit boundaries are also inserted into "work".
+
+`start_from_slices`::
+        Will mark all start-commits in the specified cache slices with a given
+        flag, and add them to the rev pending list.  Will include all if no
+        slices are specified.
+
+`coagulate_cache_slices`::
+        Generate a slice based on the passed `rev_info` instance, replacing all
+        encountered slices with one (larger) slice.  The `ignore_size` field in
+        `rci`, if non-zero, will dictate which cache slice sizes to ignore in
+        both traversal and replacement.
+
+`regenerate_index`::
+        Remake cache index.
+
+Example Usage
+-------------
+
+A few examples to demonstrate usage:
+
+.Creating a slice
+----
+/* pretend you're a porcelain for rev-cache reading from the command line */
+struct rev_info revs;
+struct rev_cache_info rci;
+
+init_revisions(&revs, 0);
+init_rci(&rci);
+
+flags = 0;
+for (i = 1; i < argc; i++) {
+        if (!strcmp(argv[i], "--not"))
+                flags ^= UNINTERESTING;
+        else if(!strcmp(argv[i], "--fresh"))
+                starts_from_slices(&revs, UNINTERESTING, 0, 0);
+        else
+                handle_revision_arg(argv[i], &revs, flags, 1);
+}
+
+/* we want to explicitly set certain options */
+rci.objects = 0;
+
+if (!make_cache_slice(&rci, &revs, 0, 0, cache_sha1))
+        printf("made slice!  it's called %s\n", sha1_to_hex(cache_sha1));
+----
+
+.Traversing a slice
+----
+/* let's say you're walking the tree with a 'work' list of current heads and a 
+ * FILO output list 'out' */
+out = 0;
+outp = &out;
+
+while (work) {
+        struct commit *commit = pop_commit(&work);
+        struct object *object = &commit->object;
+        unsigned char *cache_sha1;
+        
+        if (cache_sha1 = get_cache_slice(object->sha1)) {
+                /* note that this will instatiate any topo-relations 
+                 * as it goes */
+                if (traverse_cache_slice(&revs, cache_sha1,
+                        commit, 0, 0, /* use defaults */
+                        &outp, &work) < 0)
+                        die("I'm overreacting to a non-fatal cache error");
+        } else {
+                struct commit_list *parents = commit->parents;
+                
+                while (parents) {
+                        struct commit *p = parents->item;
+                        struct object *po = &p->object;
+                        
+                        parents = parents->next;
+                        if (po->flags & UNINTERESTING)
+                                continue;
+                        
+                        if (object->flags & UNINTERESTING)
+                                po->flags |= UNINTERESTING;
+                        else if (po->flags & SEEN)
+                                continue;
+                        
+                        if (!po->parsed)
+                                parse_commit(p);
+                        insert_by_date(p, &work);
+                }
+                
+                if (object->flags & (SEEN | UNINTERESTING) == 0)
+                        outp = &commit_list_insert(commit, outp);
+                object->flags |= SEEN;
+        }
+}
+----
+
+Some Internals
+--------------
+
+Although you really don't need to know anything about how rev-cache actually
+does its magic shizz, a bit of background may go a long way if you're wading
+through the source.
+
+File Formats
+~~~~~~~~~~~~
+
+A slice has a basic fixed-size header, followed by a certain number of object
+entries, then a NULL-seperated list of object names.  Commits are sorted in
+topo-order, and each commit entry is followed by the objects added in that
+commit.
+
+----
+         -- +--------------------------------+
+header      | object number, etc...          |
+         -- +--------------------------------+
+commit      | commit info                    |
+entry       | path data                      |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | ...                            |
+         -- +--------------------------------+
+commit      | commit info                    |
+entry       | path data                      |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | ...                            |
+         -- +--------------------------------+
+...         ...                               
+         -- +--------------------------------+
+name list   | \0some_file_name\0             |
+(note       +--------------------------------+
+preceeding  | another_file\0                 |
+null)       ...                              |
+            +--------------------------------+
+----
+
+The index is a single file that associates SHA-1s with cache slices and file
+positions.  It is somewhat similar to pack-file indexes, containing a fanout
+table and a list of index entries sorted by hash.
+
+----
+         -- +--------------------------------+
+header      | object #, cache #, etc.        |
+         -- +--------------------------------+
+sha1s of    | SHA-1                          |
+slices      | ...                            |
+         -- +--------------------------------+
+fanout      | fanout[0x00]                   |
+table       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+            | fanout[0xff]                   |
+         -- +--------------------------------+
+index       | SHA-1 of object                |
+entries     | index of cache slice SHA-1     |
+            | position in cache slice        |
+            +--------------------------------+
+            |                                |
+            ...                               
+            +--------------------------------+
+----
+
+All the relavant structures are readily accessible in `rev-cache.c`
+
+Mechanics
+~~~~~~~~~
+
+The most important part of rev-cache is its method of encoding topological
+relations.  To ensure fluid traversal and reconstruction, commits are related
+through high-level "streams"/"channels" rather than individual
+interconnections.  Intuitively, rev-cache stores history the way gitk shows it:
+commits strung up on lines, which interconnect at merges and branches.
+
+Each commit is associated to a given channel/path via a 'path id', and
+variable-length fields govern which paths (if any) are closed or opened at that
+object.  This means that topo-data can be preserved in only a few bytes extra
+per object entry.  Other information stored per entry is the sha-1 hash, type,
+date, size, name, and status in cache slice.  Here is format of an object
+entry, both on-disk and in-memory:
+
+----
+struct object_entry {
+        unsigned type : 3;
+        unsigned is_end : 1;
+        unsigned is_start : 1;
+        unsigned uninteresting : 1;
+        unsigned include : 1;
+        unsigned flags : 1;
+        unsigned char sha1[20];
+        
+        unsigned merge_nr : 6;
+        unsigned split_nr : 7;
+        unsigned size_size : 3;
+        unsigned name_size : 3;
+        unsigned what_to_do : 5;
+        
+        uint32_t date;
+        uint16_t path;
+        
+        /* merge paths */
+        /* split paths */
+        /* size */
+        /* name index */
+};
+----
+
+An explanation of each field:
+
+`type`::
+	Object type
+`is_end`::
+	The commit has some parents outside the cache slice (all if slice has
+	legs)
+`is_start`::
+	The commit has no children in cache slice
+`uninteresting`::
+	Run-time flag, used in traversal
+`include`::
+	Run-time flag, used in traversal (initialization)
+`flags`::
+	Currently unused, extra bit
+`sha`::
+	Object SHA-1 hash
+
+`merge_nr`::
+	The number of paths the current channel diverges into; the current path
+	ends upon any merge.
+`split_nr`::
+	The number of paths this commit ends; used on both merging and
+	branching.
+`size_size`::
+	Number of bytes the object size takes up.
+`name_size`::
+	Number of bytes the name index takes up.
+`what_to_do`::
+	Currently space-filler, to properly align the struct fields.  Can
+	potentially be used for anything.
+
+merge paths::
+	The path IDs (16-bit) that are to be created.
+split paths::
+	The path IDs (16-bit) that are to be ended.
+size::
+	The size split into the minimum number of bytes.
+name index::
+	An offset for the null-seperated, object name list at the end of the
+	cache slice.  Also split into the minimum number of bytes.
+
+Each path ID refers to an index in a 'path array', which stores the current
+status (eg. active, interestingness) of each channel.
+
+Due to topo-relations and boundary tracking, all of a commit's parents must be
+encountered before the path is reallocated.  This is achieved by using a
+counter system per merge: starting at the parent number, the counter is
+decremented as each parent is encountered (dictated by 'split paths'); at 0 the
+path is cleared.
+
+Boundary tracking is necessary because non-commits are stored relative to the
+commit in which they were introduced.  If a series of commits is not included
+in the output, the last interesting commit must be parsed manually to ensure
+all objects are accounted for.
+
+To prevent list-objects from recursing into trees that we've already taken care
+of, the flag `FACE_VALUE` is introduced.  An object with this flag is not
+explored (= "taken at face value"), significantly reducing I/O and processing
+time.
+
+(NSE)
-- 
tg: (6f604e7..) t/revcache/docs (depends on: t/revcache/integration)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/6 (v2)] revision caching documentation: man page and technical discussion
  2009-08-08  7:31 [PATCH 1/6 (v2)] revision caching documentation: man page and technical discussion Nick Edelen
@ 2009-08-08 18:31 ` Junio C Hamano
  2009-08-09 14:01   ` Nick Edelen
  2009-08-16 22:41 ` Nick Edelen
  1 sibling, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2009-08-08 18:31 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Nicolas Pitre, Johannes Schindelin, Sam Vilain, Michael J Gruber,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

"Nick Edelen" <sirnot@gmail.com> writes:

> Before any code is introduced the full documentation is put forth.  This 
> provides a man page for the porcelain, and a technical doc in technical/.  The 
> latter describes the API, and discusses rev-cache's design, file format and 
> mechanics.
>
> Signed-off-by: Nick Edelen <sirnot@gmail.com>

Quite respectable work, in that not many people try to start from
specification. 

I'll ask quite a many questions in the following.  I do not want you to
explain them to _me_ in your response.  I am asking them so that the v3
and subsequent revision will remove the need for these questions to be
asked by somebody (other than me) who reads the document.

>  Documentation/git-rev-cache.txt       |   96 +++++++++
>  Documentation/technical/rev-cache.txt |  379 +++++++++++++++++++++++++++++++++
>  2 files changed, 475 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/git-rev-cache.txt b/Documentation/git-rev-cache.txt
> new file mode 100755

An executable documentation?

> index 0000000..600cf64
> --- /dev/null
> +++ b/Documentation/git-rev-cache.txt
> @@ -0,0 +1,96 @@
> +git-rev-cache(1)
> +================
> +
> +NAME
> +----
> +git-rev-cache - Add, walk and maintain revision cache slices.

Drop the last '.'; also you may need to add this command to the command
classification list so that "man git" would know where to include this in
its command list.

> +SYNOPSIS
> +--------
> +'git-rev-cache' COMMAND [options] [<commit>...]
> +
> +DESCRIPTION
> +-----------
> +A front end for the rev-cache API.  It is mainly intended for cache slice
> +generation and maintenance, but can also walk commits within a slice.  It
> +currently provides basic administrative functionality.
> +

Briefly discuss

 - what rev-cache is for (e.g. "an optional mechanism to speed up
   revision traversal");

 - what a cache slice is (e.g. "records such and such for a set of commit
   objects");

 - how these two relate to each other (e.g. "one or more slices are tied
   together with a single rev-cache-index to form a rev-cache); and

 - how a rev-cache relates to a repository or an object store (e.g. "a
   given repository can have up to N rev-cache that describes its
   objects")

here.

> +COMMANDS
> +--------
> +
> +add
> +~~~
> +Add revisions to the cache by creating a new cache slice.  Reads a revision
> +list from the command line, formatted as: `START START ... \--not END END ...`
> +
> +Options:
> +
> +\--all::
> +	Include all branch heads in the new cache slice.

Is this "all branch heads"?  Shouldn't it be "all refs?"  If not, why not?

> +\--fresh::
> +	Exclude everything already in the revision cache.  

This concept seems to correspond to --incremental in pack-objects parlance.

> +\--stdin::
> +	Read newline-seperated revisions from the standard input.  Use \--not
> +	to exclude commits, as on the command line.
> +
> +\--legs::
> +	Ensure newly-generated cache slice is self-contained.  This means that
> +	no commit has partially cached parents, in that all its parents are
> +	cached or none of them are.

If I am reading the description correctly what you mean is that for a
history:

	---A---C---D
              /
          ---B

a slice is allowed to consist of (empty), (D), (C D), (A B C D), but not
(A C D), if it wants to be "--legs compliant".

But without your definition in the second and third line, people would not
call (C D) "self-contained", as you cannot tell what C's parents are.
Because you do not use "self-containedness" in the rest of documentation
to explain other things, you do not need to introduce such an unintuitive
definition of a word.  I'd suggest dropping the first line.

Also for the lack of better word I said "--legs compliant" above.  We do
need a better word to express this concept, and use that as the option
name.  Unfortunately, I can tell "legs" is not it, but I cannot offhand
think of what word should it be.

> +\--noobjects::
> +	Non-commit objects are normally included along with the commit with
> +	which they were introduced.  This is obviously very benificial, but can
> +	take longer in cache slice generation.  Use this option to disable
> +	non-commit object caching.

"--no-objects" (I think you meant reverse of --objects to rev-list).

What is the ramification of using this option?  "rev-list" and "gitk" will
still be fast but "pack-objects" won't get as much speed-up?  Will use of
the option affect correctness?  Can a slice generated with this option
later fused with another generated without this option?  What are the
performance (or correctness) characteristics of a resulting slice of such
a fuse?

If the target audience of this manual page is Porcelain writers and API
users, these things need to be explained somewhere in this documentation.

> +walk
> +~~~~
> +Walk a cache slice based on revision list, formatted like 'add'.

What is formatted like 'add'?  The output from the command?  The command
line arguments this subcommand takes?

Define "walk".  Is it "dump the contents of a slice"?  In a rev-cache with
more than one slice, which slice is chosen to be walked, and how?  How
does the command line argument and options interract with cached data?

For example...

> +Options:
> +
> +\--objects::
> +	Like 'git-rev-list', 'walk' will normally only list commits.  Use this
> +	option to list non-commit objects as well.

... how does walking with --objects option behave if you created all of
your slices with --no-objects?  Does it notice the lack of cached object
information and falls back on reading from the objects?  If so how is this
different from rev-list itself?  If not, when is the walk command useful?
"while debugging the rev-cache machinery" is a perfectly acceptable answer
to the last question, but if that is the case please say so in the
documentation.

> +fuse
> +~~~~
> +Merge several cache slices into a single large slice.  On each invocation of
> +'add' a new file ("slice") is added to the revision cache directory, and after
> +several additions the directory may become populated with many, relatively
> +small slices.  Numerous smaller slices will yield poorer performance than a one
> +or two large ones, because of the overhead of loading new slices into memory.

Something like this is a perfect overview description that should have
been given upfront (remember, I earlier said "Briefly discuss..."?).

This operation feels like "repack".

> +Running 'fuse' every once in a while will solve this problem by coalescing all
> +the cache slices into one larger slice.  For very large projects, using
> +\--ignore-size is advisable to prevent overly large cache slices.
> +
> +Options:
> +
> +\--all::
> +	Normally fuse will only include everything that's already in the
> +	revision cache.  \--add tells it to start walking from the branch
> +	heads, effectively an `add --all --fresh; fuse` (pseudo-command).
> +
> +\--noobjects::
> +	As in 'add', this option disables inclusion of non-commit objects.  If
> +	some cache slices do contain such objects, the information will be lost.
> +
> +\--ignore-size[=N]::
> +	Do not merge cache slices of size >=N.  If N is not provided 'fuse'
> +	will default to a size of ~25MB (be aware that slices must be mapped to
> +	memory).

Should this autotune, perhaps relative to pack.windowMemory?  --ignore
sounds misnomer; it is similar to "keep" flag that packfiles have.

> +index
> +~~~~~
> +Regenerate the cache index.  If the index file associating objects with cache
> +slices gets corrupted, lost, or otherwise becomes unusable, 'index' will
> +quickly regenerate the file.  'fuse' will also regenerate the index.

We already have .git/index (the index), and .git/objects/pack/pack-*.idx
(pack idx files).  At least please call this "the rev-cache index" so that
technical documentations can differentiate when they want to talk about
this new index without having to worry about disambiguities.

At the beginning of your DESCRIPTION, you mentioned "rev-cache", then
"cache slice", without explaining what they are, or how they relate to
each other, leaving frustration to the readers.  The reader who reads on,
suppressing this frustration, is finally rewarded at the beginning of
"fuse" description, where you mention that a rev-cache contains one or
more cache slices.  At that point, how these two concepts relate to each
other is finally revealed.

But here, you introduce a new frustration.  There is no meantion in how
the rev-cache index relates to the other two concepts introduced earlier
(the rev-cache itself, and cache slices).

When is this "reindex" operation necessary under normal circumstances?  Is
it primarily to recover bugs in rev-cache?  For example, "repack -a -f" is
not about recovering from bugs in "repack" but is a way to squeeze out
more redundancy from your pack because repeated "repack -a" may degrade
the pack efficiency over time.  Is "reindex" useful for a similar reason,
even when the rev-cache mechanism is bug-free?

> diff --git a/Documentation/technical/rev-cache.txt b/Documentation/technical/rev-cache.txt
> new file mode 100755

Executable documentation.

> index 0000000..78d7218
> --- /dev/null
> +++ b/Documentation/technical/rev-cache.txt
> @@ -0,0 +1,379 @@
> +rev-cache
> +=========
> +
> +The revision cache API ('rev-cache') provides a method for efficiently storing
> +and accessing commit branch sections.  Such branch slices are defined by a
> +series of top (interesting) and bottom (uninteresting) commits.  Each slice
> +contains, per commit:
> +
> +* All intra-slice topological relations, encoded into path "channels" (see
> +  'Mechanics' for full explanation).
> +* Object meta-data: type, SHA-1, size, date (for commits).
> +* Objects introduced in that commit, relative to slice (ie. only for non-start
> +  commits).

"introduced"?  That is a bit puzzling description to me, probably because
of an untold assumption.

Do you mean, in a two-commit history "B builds on top of root A", your
slice lists objects by traversing topology from the top, so commit A is
listed, and the tree and its subtrees and blobs are "introduced to the
slice", and then commit B is listed, along with its tree and associated
objects that have not been listed in the slice before commit B (namely,
objects commit B changed since commit A) are "introduced to the slice (by
commit B)"?

You can alley the above puzzlement by saying something like:

	Each slice stores information on commits and its associated
	objects, in the topological order traversing from the top
	commits.  For each commit, the following is recorded:

> +Storage data structures are not exported, in part to keep git's global scope
> +clean, but largely because they're pretty much useless outside of rev-cache.
> +
> +The API
> +-------
> +
> +The API for 'rev-cache' is quite simple.  You can find the function prototypes
> +in `revision.h`.

Drop the first sentence and leave the judging of simplicity to the reader.

> +Data Structures
> +~~~~~~~~~~~~~~~
> +
> +The `rev_cache_info` struct holds all the options and flags for the API.
> +
> +----
> +struct rev_cache_info {
> +        /* generation flags */
> +        unsigned objects : 1,
> +                legs : 1,
> +                make_index : 1;
> +        
> +        /* traversal flags */
> +        unsigned save_unique : 1,
> +                add_to_pending : 1,
> +                add_names : 1;
> +        
> +        /* fuse options */
> +        unsigned int ignore_size;
> +};
> +----
> +
> +The fields:
> +
> +`objects`::
> +	Add non-commit objects to slice.
> +
> +`legs`::
> +	Ensure end commits have no children.

What are "end commits"?

> +`make_index`::
> +	Integrate newly-made slice into index.
> +
> +`save_unique`::
> +	Load non-commit objects into `unique` field of their associated
> +	`commit` object.

What is "unique" field?  What is it used for?  Is it a list that can hold
many objects?

> +`add_to_pending`::
> +	Append unique non-commit objects to the `pending` object list in the
> +	passed `rev_info` instance.
> +
> +`add_names`::
> +	Include non-commit object names in the pending object entries (if
> +	`add_to_pending` is set), and set the `name` field of `blob` and `tree`
> +	objects (if `save_unique`).
> +
> +`ignore_size`::
> +	If non-zero, ignore slices with size greater or equal to this during
> +fusion.
> +
> +Functions
> +~~~~~~~~~
> +
> +`init_rci`::
> +        Initiate a `rev_cache_info` struct to default options.

"Initialize"?

Spell it out like "init_rev_cache_info()".  It's not like you will be
calling this over and over again, and "i" is ambiguous between rev cache
info and rev cache index.

> +`make_cache_slice`::
> +        Create a cache based on an a `rev_info` instance or `commit_list` s of
> +        "starts" and "ends" (defaulting to latter if `rev_info` pointer is
> +        NULL), copying the cache SHA-1 into a passed pointer if non-zero.  A

What's "cache SHA-1"?  I imagine that rev-cache consists of one or more
cache slices, and each slice is named after the hash value of its data,
but the documentation should not make _me_ guess.

Also a pointer is either NULL or non-NULL; it is never "zero".

> +        `rev_cache_info` struct pointer can be passed to set options, while
> +        passing NULL will set default options.  A last parameter can
> +        optionally recieve the final cache hash.

Is it just me who wish to see list of parameters on the head line, e.g.

	make_cache_slice(struct rev_info *, ...)::

to understand what the description is talking about better?

> +`make_cache_index`::
> +        Add a slice to the cache index.  Requires a file descriptor, the cache
> +        hash and the file size.  Note that this is normally called by
> +        `make_cache_slice` under the `make_index` option.

A file descriptor points at what file?  The index is written to that
descriptor?  The slice is read from the descriptor?  What is "cache hash"?
File size of what file is taken as parameter?

> +`get_cache_slice`::
> +        Given a commit SHA-1 `get_cache_slice` will search the slice index and
> +        return, if found, the cache-identifying SHA-1.

Is there a "slice index"?  How does it relate to "the cache index" the
earlier documentation mentioned in its description of the "index"
subcommand?

> +`traverse_cache_slice`::
> +        Traverse a specified cache slice based on:
> +
> +        * `rev_info` instance; options set in the `rev_cache_info` field.
> +        * cache SHA-1 identifier
> +        * a starting commit and commit work list
> +        * date of oldest encountered interesting commit (passing 0 will tell
> +          rev-cache to use defaults)
> +        * current `slop` (this is a counter, a la `revision.c`, to determine
> +          when to stop traversing; typically you can just pass 0 and rev-cache
> +          will use default settings)
> +        
> ++
> +The output is sent to a FILO `commit_list` "queue", while any end commits 
> +are passed back into the work list.  If the walk is not contained within the 
> +slice, commit boundaries are also inserted into "work".
> +
> +`start_from_slices`::
> +        Will mark all start-commits in the specified cache slices with a given
> +        flag, and add them to the rev pending list.  Will include all if no
> +        slices are specified.
> +
> +`coagulate_cache_slices`::
> +        Generate a slice based on the passed `rev_info` instance, replacing all
> +        encountered slices with one (larger) slice.  The `ignore_size` field in
> +        `rci`, if non-zero, will dictate which cache slice sizes to ignore in
> +        both traversal and replacement.

"coagulate"?  If you want to say "fuse", use that word consistently.

> +Some Internals
> +--------------
> +
> +Although you really don't need to know anything about how rev-cache actually
> +does its magic shizz, a bit of background may go a long way if you're wading
> +through the source.

Drop these three lines.  People know that already when they are reading
this section.

> +File Formats
> +~~~~~~~~~~~~
> +
> +A slice has a basic fixed-size header, followed by a certain number of object
> +entries, then a NULL-seperated list of object names.  Commits are sorted in
> +topo-order, and each commit entry is followed by the objects added in that
> +commit.

What's in the header?  Is there a version number?

> +struct object_entry {

Rename this to something more rev-cache specific, perhaps rev_cache_entry,
as the name is already used for different purposes in pack-objects and
fast-import.

> +        unsigned type : 3;
> +        unsigned is_end : 1;
> +        unsigned is_start : 1;
> +        unsigned uninteresting : 1;
> +        unsigned include : 1;
> +        unsigned flags : 1;
> +        unsigned char sha1[20];
> +        
> +        unsigned merge_nr : 6;
> +        unsigned split_nr : 7;
> +        unsigned size_size : 3;
> +        unsigned name_size : 3;
> +        unsigned what_to_do : 5;
> +        
> +        uint32_t date;
> +        uint16_t path;
> +        
> +        /* merge paths */
> +        /* split paths */
> +        /* size */
> +        /* name index */
> +};
> +----
> +
> +An explanation of each field:
> +
> +`type`::
> +	Object type
> +`is_end`::
> +	The commit has some parents outside the cache slice (all if slice has
> +	legs)
> +`is_start`::
> +	The commit has no children in cache slice

You said "top" and "bottom" in the documentation.  Perhaps use them
here as well for consistency?

> +`uninteresting`::
> +	Run-time flag, used in traversal
> +`include`::
> +	Run-time flag, used in traversal (initialization)
> +`flags`::
> +	Currently unused, extra bit
> +`sha`::
> +	Object SHA-1 hash

That's `sha1` if I am reading the above display correctly.

> +`merge_nr`::
> +	The number of paths the current channel diverges into; the current path
> +	ends upon any merge.
> +`split_nr`::
> +	The number of paths this commit ends; used on both merging and
> +	branching.

How will an overflowing situation handled?

> +`size_size`::
> +	Number of bytes the object size takes up.
> +`name_size`::
> +	Number of bytes the name index takes up.
> +`what_to_do`::
> +	Currently space-filler, to properly align the struct fields.  Can
> +	potentially be used for anything.

You seem to be assuming that 24-bit fields starting at merge_nr (begins at
offset 21) will be packed and it is Ok that after this filler field date
is aligned to multiple of 4.  I am not sure if that is a safe assumption.

We use 'unsigned long' for timestamps.  Is uint32_t sufficient?

What is `path` (not "merge paths" nor "split paths" you explain next)?
Is uint32_t sufficient?

> +merge paths::
> +	The path IDs (16-bit) that are to be created.
> +split paths::
> +	The path IDs (16-bit) that are to be ended.

Are 65536 paths sufficient?  How will an overflowing situation handled?

> +size::
> +	The size split into the minimum number of bytes.
> +name index::
> +	An offset for the null-seperated, object name list at the end of the
> +	cache slice.  Also split into the minimum number of bytes.

What does this mean?  1-to-8 bytes integer in network byte order?

Did you consider using the variable-length size representation used by the
pack object layer (cf. encode_header() in builtin-pack-objects.c and
unpack_object_header_buffer() in sha1_file.c)?

> +
> +Each path ID refers to an index in a 'path array', which stores the current
> +status (eg. active, interestingness) of each channel.
> +
> +Due to topo-relations and boundary tracking, all of a commit's parents must be
> +encountered before the path is reallocated.  This is achieved by using a
> +counter system per merge: starting at the parent number, the counter is
> +decremented as each parent is encountered (dictated by 'split paths'); at 0 the
> +path is cleared.
> +
> +Boundary tracking is necessary because non-commits are stored relative to the
> +commit in which they were introduced.  If a series of commits is not included
> +in the output, the last interesting commit must be parsed manually to ensure
> +all objects are accounted for.
> +
> +To prevent list-objects from recursing into trees that we've already taken care
> +of, the flag `FACE_VALUE` is introduced.  An object with this flag is not
> +explored (= "taken at face value"), significantly reducing I/O and processing
> +time.

> +(NSE)

You do not need this last line.  As 530e741 (Start preparing the API
documents., 2007-11-24) mention, these were pointers to people who may
know enough information to fill in _undocumented_ territory.  You
documented this API fairly completely.

Besides, nobody would know who NSE is ;-).  Back then, there were only
handful of people, all of whom can be ientified by short names, who were
responsible for large body of internal API code that needed further
documentation.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/6 (v2)] revision caching documentation: man page and  technical discussion
  2009-08-08 18:31 ` Junio C Hamano
@ 2009-08-09 14:01   ` Nick Edelen
  2009-08-09 18:16     ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Edelen @ 2009-08-09 14:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Johannes Schindelin, Sam Vilain, Michael J Gruber,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Wow, thank you very much for your commentary!

Unfortunately I don't have a stable internet connection this week
(although I can get on at minimum once a day), but I'll rework the
docs today and upload a new patch for them tomorrow or the next day.
I'll also fix the file permissions ;-)

As you said, I'll answer most of your questions in the revised docs,
but I have one question about the time field size.  git uses unsigned
long internally for its commit date, so it'd be limited by the 32bits
a long is on 32bit archs.  So, wouldn't the commit times effectively
be a uin32_t?  I mean if not I can change it; I just figured I'd save
some space.

Other than that there shouldn't be an overflow problem, unless
someone's merging over 63 branches in one go.  Maybe I'll expand that
a bit just in case though...

Thanks again,
 - Nick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/6 (v2)] revision caching documentation: man page and  technical discussion
  2009-08-09 14:01   ` Nick Edelen
@ 2009-08-09 18:16     ` Junio C Hamano
  0 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2009-08-09 18:16 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Nicolas Pitre, Johannes Schindelin, Sam Vilain, Michael J Gruber,
	Jeff King, Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Nick Edelen <sirnot@gmail.com> writes:

> As you said, I'll answer most of your questions in the revised docs,
> but I have one question about the time field size.  git uses unsigned
> long internally for its commit date, so it'd be limited by the 32bits
> a long is on 32bit archs.  So, wouldn't the commit times effectively
> be a uin32_t?

The current implementations may be either 32 or 64 bit, but at the on-disk
permanent data structure level they are decimal integers of unlimited
size.  As the rev-cache, similarly to the index, is a local "cache" whose
nature is transient (iow we can afford to say "this new version of git
updates the file format for the rev-cache; your existing rev-cache will be
converted automatically"), its data representation can have limitation
tied to the implementation.  But the limitation should be documented.

As to defining the on-disk data format using C structure, I'd strongly
suggest looking at, learning from and mimicking how read-cache.c and
cache.h handles struct ondisk_cache_entry.  I find writing out a structure
with bitfields as-is, like your patch does, quite iffy.

> Other than that there shouldn't be an overflow problem, unless
> someone's merging over 63 branches in one go.

The current Porcelains may not allow it, but at the data structure level
commit can have arbitrary number of parents, so that assumption is already
broken at the design level.  You need an escape hatch there.

"Because such a merge is so rare, walking the rev-cache that contains such
a commit gives a wrong result without any diagnosis to the caller" is an
unacceptable escape hatch.  It is perfectly fine if the escape hatch is
"because it is so rare, rev-cache marks commits involved in such a path,
and walking the path falls back to actually reading the objects, without
any performance benefit from the cache".  It probably is also Ok if the
escape hatch is "because it is beyond the implementation limit, 'add'
command errors out and rev-cache won't kick in for such a history."

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/6 (v2)] revision caching documentation: man page and technical discussion
  2009-08-08  7:31 [PATCH 1/6 (v2)] revision caching documentation: man page and technical discussion Nick Edelen
  2009-08-08 18:31 ` Junio C Hamano
@ 2009-08-16 22:41 ` Nick Edelen
  1 sibling, 0 replies; 5+ messages in thread
From: Nick Edelen @ 2009-08-16 22:41 UTC (permalink / raw)
  To: Nick Edelen, Junio C Hamano, Nicolas Pitre, Johannes Schindelin,
	Sam Vilain

Before any code is introduced the full documentation is put forth.  This 
provides a man page for the porcelain, and a technical doc in technical/.  The 
latter describes the API, and discusses rev-cache's design, file format and 
mechanics.

Signed-off-by: Nick Edelen <sirnot@gmail.com>

---
 Documentation/git-rev-cache.txt       |  144 ++++++++
 Documentation/technical/rev-cache.txt |  614 +++++++++++++++++++++++++++++++++
 2 files changed, 758 insertions(+), 0 deletions(-)

diff --git a/Documentation/git-rev-cache.txt b/Documentation/git-rev-cache.txt
new file mode 100644
index 0000000..3479499
--- /dev/null
+++ b/Documentation/git-rev-cache.txt
@@ -0,0 +1,144 @@
+git-rev-cache(1)
+================
+
+NAME
+----
+git-rev-cache - Add, walk and maintain revision cache slices
+
+SYNOPSIS
+--------
+'git-rev-cache' COMMAND [options] [<commit>...]
+
+DESCRIPTION
+-----------
+The revision cache ('rev-cache') provides a mechanism for significantly
+speeding up revision traversals.  It does this by creating an efficient
+database (cache) of commits, their related objects and topological relations.
+Independant of packs and the object store, this database is composed of
+rev-cache "slices" -- each a different file storing a given segment of commit
+history.  To map commits to their respective slices, a single index file is
+kept for the rev-cache.
+
+'git-rev-cache' provides a front-end for the rev-cache mechanism, intended for
+updating and maintaining rev-cache slices in the current repository.  New cache
+slice files can be 'add'ed, to keep the cache up-to-date; individual slices can
+be traversed; smaller slices can be 'fuse'd into a larger slice; and the
+rev-cache index can be regenerated.
+
+COMMANDS
+--------
+
+add
+~~~
+Add revisions to the cache by creating a new cache slice.  Reads a revision
+list from the command line, formatted as: `START START ... \--not END END ...`
+
+Options:
+
+\--all::
+	Include all refs in the new cache slice, like the \--all option in
+	'rev-list'.
+
+\--fresh::
+	Exclude everything already in the revision cache, analogous to
+	\--incremental in 'pack-objects'.
+
+\--stdin::
+	Read newline-seperated revisions from the standard input.  Use \--not
+	to exclude commits, as on the command line.
+
+\--legs::
+	Ensure newly-generated cache slice has no partial ends.  This means that
+	no commit has partially cached parents, in that all its parents are
+	cached or none of them are.
++
+\--legs will cause 'rev-cache' to expand potential slice end-points (creating
+"legs") until this condition is met, simplifying the cache slice structure.
+'rev-cache' itself does not care if a slice has legs or not, but the condition
+may reduce the required complexity of other applications that might use the
+revision cache.
+
+\--no-objects::
+	Non-commit objects are normally included along with the commit with
+	which they were introduced.  This is obviously very benificial, but can
+	take longer in cache slice generation.  Using this option will disable
+	non-commit object caching.
++
+\--no-objects is mainly intended for debugging or development purposes, but may
+find use in special situations (e.g. common traversal of only commits).
+
+walk
+~~~~
+Analogous to a slice-oriented 'rev-list', 'walk' will traverse a region in a
+particular cache slice.  Interesting and uninteresting (delimited, as with
+'rev-list', with \--not) are specified on the command line, and output is the
+same as vanilla 'rev-list'.
+
+Options:
+
+\--objects::
+	Like 'rev-list', 'walk' will normally only list commits.  Use this
+	option to list non-commit objects as well, if they are present in the
+	cache slice.
+
+fuse
+~~~~
+Merge several cache slices into a single large slice, like 'repack' for
+'rev-cache'.  On each invocation of 'add' a new file ("slice") is added to the
+revision cache directory, and after several additions the directory may become
+populated with many, relatively small slices.  Numerous smaller slices will
+yield poorer performance than a one or two large ones, because of the overhead
+of loading new slices into memory.
+
+Running 'fuse' every once in a while will solve this problem by coalescing all
+the cache slices into one larger slice.  For very large projects, using
+\--ignore-size is advisable to prevent overly large cache slices.  Setting git
+'config' option 'gc.revcache' to 1 will enable cache slice fusion upon garbage
+collection.
+
+Note that 'fuse' uses the internal revision walker, so the options used in
+fusion override those of the cache slices upon which it operates.  For example,
+if some slices were generated with \--no-objects, yet 'fuse' was performed with
+non-commit objects, the resulting slice would still contain objects but would
+take longer to generate.
+
+Options:
+
+\--all::
+	Normally fuse will only include everything that's already in the
+	revision cache.  \--add tells it to start walking from the branch
+	heads, effectively an `add --all --fresh; fuse` (pseudo-command).
+
+\--no-objects::
+	As in 'add', this option disables inclusion of non-commit objects.  If
+	some cache slices do contain such objects, the information will be lost.
+
+\--ignore-size[=N]::
+	Do not merge cache slices of size >=N (be aware that slices must be
+	mapped to memory).  N can have a suffix of "k" or "m", denoting N as
+	kilobytes and megabytes, respectively.  If N is not provided 'fuse'
+	will default to a size of ~25MB.
+
+index
+~~~~~
+Regenerate the revision cache index.  If the rev-cache index file associating
+objects with cache slices gets corrupted, lost, or otherwise becomes unusable,
+'index' will quickly regenerate the file.  It's most likely that this won't be
+needed in every day use, as it is targeted towards debugging and development.
+
+alt
+~~~
+Create a cache slice pointer to another slice, identified by its full path:
+`fuse path/to/other/slice`
+
+This command is useful if you have several repositories sharing a common
+history.  Although space requirements for rev-cache are slim anyway, you can in
+this situation reduce it further by using slice pointers, pointing to relavant
+slices in other repositories.  Note that only one level of redirection is
+allowed, and the slice pointer will break if the original slice is removed.
+'fuse' will not touch slice pointers.
+
+DISCUSSION
+----------
+For an explanation of the API and its inner workings, see
+link:technical/rev-cache.txt[technical info on rev-cache].
diff --git a/Documentation/technical/rev-cache.txt b/Documentation/technical/rev-cache.txt
new file mode 100644
index 0000000..6e7c7f6
--- /dev/null
+++ b/Documentation/technical/rev-cache.txt
@@ -0,0 +1,614 @@
+rev-cache
+=========
+
+The revision cache API ('rev-cache') provides a method for efficiently storing
+and accessing commit branch sections.  Such branch slices are defined by a
+series of start/top (interesting) and end/bottom (uninteresting) commits.  Each
+slice contains information on commits in topological order.  Recorded with each
+commit is:
+
+* All intra-slice topological relations, encoded into path "channels" (see
+  'Mechanics' for full explanation).
+* Object meta-data: type, SHA-1, size, date (for commits).
+* Objects introduced by that commit, not present in the its cached parents.
+
+In addition to the API, basic structures are exported for the possibility of
+direct access.
+
+The API
+-------
+You can find the function prototypes in `revision.h`.
+
+Data Structures
+~~~~~~~~~~~~~~~
+The `rev_cache_info` struct holds all the options and flags for the API.
+
+----
+struct rev_cache_info {
+	/* generation flags */
+	unsigned objects : 1,
+		legs : 1,
+		make_index : 1,
+		fuse_me : 1;
+
+	/* index inclusion */
+	unsigned overwrite_all : 1;
+
+	/* traversal flags */
+	unsigned add_to_pending : 1;
+
+	/* fuse options */
+	unsigned int ignore_size;
+
+	/* reserved */
+	struct rev_cache_slice_map *maps,
+		*last_map;
+};
+----
+
+The fields:
+
+`objects`::
+	Add non-commit objects to slice.
+
+`legs`::
+	Ensure end/bottom commits have no children.
+
+`make_index`::
+	Integrate newly-made slice into index.
+
+`fuse_me`::
+	This is specified if a fuse is occuring, and slices are to be reused.
+	This option requires `maps` and `last_maps` to be initialized.
+
+`overwrite_all`::
+	When a cache slice is added to the index, sometimes overlap occures
+	between it and other slices.  Normally, original index entries are kept
+	unless the new entry represents a start commit (older entries are more
+	likely to lead to greater in-slice traversals).  This options overrides
+	that, and updates all entries of the new slice.
+
+`add_to_pending`::
+	Append unique non-commit objects to the `pending` object list in the
+	passed `rev_info` instance.
+
+`add_names`::
+	Include non-commit object names in the pending object entries if
+	`add_to_pending` is set.
+
+`ignore_size`::
+	If non-zero, ignore slices with size greater or equal to this during
+fusion.
+
+`maps`/`last_map`::
+	An array of slice mappings, indexed by their id in the slice index
+	header, to be re-used with `fuse_me`.  `last_map` points to the last
+	mapping used, and should be initialized to 0.
+
+Functions
+~~~~~~~~~
+
+init_rev_cache
+^^^^^^^^^^^^^^
+----
+void init_rev_cache_info(
+	struct rev_cache_info *rci OUT
+)
+----
+
+Initialize `rci` to default options.
+
+make_cache_slice
+^^^^^^^^^^^^^^^^
+----
+int make_cache_slice(
+	struct rev_cache_info *rci IN,
+	struct rev_info *revs IN,
+	struct commit_list **starts IN/OUT,
+	struct commit_list **ends IN/OUT,
+	unsigned char *cache_sha1 OUT
+)
+----
+
+Create a cache slice based on either `revs` (if non-NULL) *or* the `starts` and
+`ends` lists.  The actual list of start and end commits of the slice may be
+different from the parameters, based on what defines the branch segment, and
+this actual list is passed back through `starts` and `ends`.
+
+The cache slice is identified via a SHA-1 generated from the actual start/end
+commit lists.  `cache_sha1`, if non-NULL, can recieve the cache slice name.
+`rci` is used to specify generation options, but can be NULL if you want
+`make_cache_slice` to fall back on defaults.  Returns 0 on success, non-zero on
+failure.
+
+make_cache_index
+^^^^^^^^^^^^^^^^
+----
+int make_cache_index(
+	struct rev_cache_info *rci IN,
+	unsigned char *cache_sha1 IN,
+	int fd IN,
+	unsigned int size IN
+)
+----
+
+Add a slice to the rev-cache index.  `cache_sha1` is the identity hash of the
+cache slice; `fd` is a file descriptor of the cache slice opened with
+read/write privileges (the slice is not actually modified); `size` is the size
+of the cache slice.  Although there are currently no options for index
+updating, `rci` is a placeholder in case of future options.  Note that this
+function is normally called by `make_cache_slice`.  Returns 0 on success,
+non-zero on failure.
+
+open_cache_slice
+^^^^^^^^^^^^^^^^
+----
+int open_cache_slice(
+	unsigned char *sha1 IN,
+	int flags IN
+)
+----
+
+Returns a file descriptor to a cache slice described by `sha1` hash, using
+`flags` as the access mode.  This will follow cache slice pointers to one level
+of indirection.
+
+get_cache_slice
+^^^^^^^^^^^^^^^
+----
+unsigned char *get_cache_slice(
+	struct commit *commit IN
+)
+----
+
+Given a commit object `get_cache_slice` will search the revision cache index
+and return, if found, the cache slice SHA-1.
+
+traverse_cache_slice
+^^^^^^^^^^^^^^^^^^^^
+----
+int traverse_cache_slice(
+	struct rev_info *revs IN/OUT,
+	unsigned char *cache_sha1 IN,
+	struct commit *commit IN,
+	unsigned long *date_so_far IN/OUT,
+	int *slop_so_far IN/OUT,
+	struct commit_list ***queue OUT,
+	struct commit_list **work IN/OUT
+)
+----
+
+Traverse a specified cache slice.  An explanation of the each field:
+
+`revs`::
+	The revision walk instance.  `traverse_cache_slice` uses this for
+	general options (e.g. which objects are included) and slice traversal
+	options (in the `rev_cache_info` field).  If the `add_to_pending`
+	option is specified, non-commit objects are appended to the `pending`
+	object list field.
+
+`cache_sha1`::
+	SHA-1 identifying the cache slice to use.  This can be taken directly
+	from `get_cache_slice`.
+
+`commit`::
+	The current commit object in the revision walk, i.e. the commit which
+	inspired this slice traversal.  Although theoretically redundant in
+	view of the `work` list, this simplifies interaction with normal
+	revision walks, which pop commits from `work` before analyzing them.
+
+`date_so_far`::
+	The date of the oldest encountered interesting commit.  Passing NULL
+	will let `traverse_cache_slice` use defaults.
+
+`slop_so_far`::
+	The `slop` value, a la revision.c.  This is a counter used to determine
+	when to stop traversing, based on how many extra uninteresting commits
+	should be encountered.  NULL will enable defaults, as above.
+
+`queue`::
+	Refers to a pointer to the head of a FIFO commit list, recieving the
+	commits we've seen and added.
+
+`work`::
+	A date-ordered list of commits that have yet to be processed (i.e. seen
+	but not added).  Commits from here present in the slice are removed
+	(and, obviously, used as starting places for traversal), and any end
+	commits encountered are inserted.
+
+starts_from_slices
+^^^^^^^^^^^^^^^^^^
+----
+void starts_from_slices(
+	struct rev_info *revs OUT,
+	unsigned int flags IN,
+	unsigned char *which IN,
+	int n IN
+)
+----
+
+Will mark start-commits in certain rev-cache slices with `flag`, and added them
+to the pending list of `revs`.  If `n` is zero, `starts_from_slices` will use
+all slices.  Otherwise `which` will specify an *unseperated* list of cache
+SHA-1s to use (20 bytes each), and `n` will contain the number of slices (i.e.
+20 * `n` = size of `which`).
+
+fuse_cache_slices
+^^^^^^^^^^^^^^^^^
+----
+int fuse_cache_slices(
+	struct rev_cache_info *rci IN,
+	struct rev_info *revs IN
+)
+----
+
+Generate a slice based on `revs`, replacing all encountered slices with one
+(larger) slice.  The `ignore_size` field in `rci`, if non-zero, will dictate
+which cache slice sizes to ignore in both traversal and replacement.
+
+regenerate_cache_index
+^^^^^^^^^^^^^^^^^^^^^^
+----
+int regenerate_cache_index(
+	struct rev_cache_info *rci IN
+)
+----
+
+Remake the revision cache index, including all the slices.  Currently no
+options in `rci` exist for index (re)generation, but some may develop in the
+future.
+
+to/from_disked_rc_object/index_entry
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----
+struct rc_object/index_entry *from_disked_rc_object/index_entry(
+	struct rc_object/index_entry_ondisk *src IN,
+	struct rc_object/index_entry *dst OUT
+)
+
+struct rc_object/index_entry_ondisk *to_disked_rc_object/index_entry(
+	struct rc_object/index_entry *src IN,
+	struct rc_object/index_entry_ondisk *dst OUT
+)
+----
+
+Functions to convert between the internal and storage (`_ondisk`) versions of
+object and index entry structures.  These are necessary for direct access to
+the cache slices.  If NULL is provided for `dst` a statically allocated
+structure is used, and a pointer to the struct is returned.  Otherwise the
+functions return `dst`.
+
+Example Usage
+-------------
+
+A few examples to demonstrate usage:
+
+.Creating a slice
+----
+/* pretend you're a porcelain for rev-cache reading from the command line */
+struct rev_info revs;
+struct rev_cache_info rci;
+
+init_revisions(&revs, 0);
+init_rci(&rci);
+
+flags = 0;
+for (i = 1; i < argc; i++) {
+        if (!strcmp(argv[i], "--not"))
+                flags ^= UNINTERESTING;
+        else if(!strcmp(argv[i], "--fresh"))
+                starts_from_slices(&revs, UNINTERESTING, 0, 0);
+        else
+                handle_revision_arg(argv[i], &revs, flags, 1);
+}
+
+/* we want to explicitly set certain options */
+rci.objects = 0;
+
+if (!make_cache_slice(&rci, &revs, 0, 0, cache_sha1))
+        printf("made slice!  it's called %s\n", sha1_to_hex(cache_sha1));
+----
+
+.Traversing a slice
+----
+/* let's say you're walking the tree with a 'work' list of current heads and a
+ * FILO output list 'out' */
+out = 0;
+outp = &out;
+
+while (work) {
+        struct commit *commit = pop_commit(&work);
+        struct object *object = &commit->object;
+        unsigned char *cache_sha1;
+
+        if (cache_sha1 = get_cache_slice(object->sha1)) {
+                /* note that this will instatiate any topo-relations
+                 * as it goes */
+                if (traverse_cache_slice(&revs, cache_sha1,
+                        commit, 0, 0, /* use defaults */
+                        &outp, &work) < 0)
+                        die("I'm overreacting to a non-fatal cache error");
+        } else {
+                struct commit_list *parents = commit->parents;
+
+                while (parents) {
+                        struct commit *p = parents->item;
+                        struct object *po = &p->object;
+
+                        parents = parents->next;
+                        if (po->flags & UNINTERESTING)
+                                continue;
+
+                        if (object->flags & UNINTERESTING)
+                                po->flags |= UNINTERESTING;
+                        else if (po->flags & SEEN)
+                                continue;
+
+                        if (!po->parsed)
+                                parse_commit(p);
+                        insert_by_date(p, &work);
+                }
+
+                if (object->flags & (SEEN | UNINTERESTING) == 0)
+                        outp = &commit_list_insert(commit, outp)->next;
+                object->flags |= SEEN;
+        }
+}
+----
+
+Some Internals
+--------------
+For more advanced usage, the slice and index file(s) may be accessed directly.
+Relavant structures are availabe in `rev-cache.h`.
+
+File Formats
+~~~~~~~~~~~~
+
+Cache Slices
+^^^^^^^^^^^^
+A slice has a basic fixed-size header, followed by a certain number of object
+entries, then a NULL-seperated list of object names.  Commits are sorted in
+topo-order, and each commit entry is followed by the objects added in that
+commit.
+
+----
+         -- +--------------------------------+
+header      | object number, etc...          |
+         -- +--------------------------------+
+commit      | commit info                    |
+entry       | path data                      |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | ...                            |
+         -- +--------------------------------+
+commit      | commit info                    |
+entry       | path data                      |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | ...                            |
+         -- +--------------------------------+
+...         ...
+         -- +--------------------------------+
+name list   | \0some_file_name\0             |
+(note       +--------------------------------+
+preceeding  | another_file\0                 |
+null)       ...                              |
+            +--------------------------------+
+----
+
+Here is the header:
+
+----
+struct rc_cache_slice_header {
+	char signature[8]; /* REVCACHE */
+	unsigned char version;
+	uint32_t ofs_objects;
+
+	uint32_t object_nr;
+	uint16_t path_nr;
+	uint32_t size;
+
+	unsigned char sha1[20];
+
+	uint32_t names_size;
+};
+----
+
+Explanations:
+
+`signature`::
+	The identifying signature of cache slice file.  Always "REVCACHE".
+`version`::
+	The version number, currently 1.
+`ofs_objects`::
+	The byte offset at which the commit/object listing starts.  Always
+	present at the 10th byte, regardless of file version.
+`object_nr`::
+	The total number of objects (commit + non-commit objects) present in
+	the slice.
+`path_nr`::
+	The total number of paths/channels used in encoding the topological
+	data.  Note that paths are reused (see 'Mechanics'), so there will
+	never be more than a few hundred paths (if that) used.
+`size`::
+	The size of the slice *excluding* the name list.  In other words, the
+	size of the portion mapped to memory.
+`sha1`::
+	The cache slice SHA-1.
+`names_size`::
+	The size of the name list.  `size` + `names_size` = size of slice
+
+Revision Cache Index
+^^^^^^^^^^^^^^^^^^^^
+The index is a single file that associates SHA-1s with cache slices and file
+positions.  It is somewhat similar to pack-file indexes, containing a fanout
+table and a list of index entries sorted by hash.
+
+----
+         -- +--------------------------------+
+header      | object #, cache #, etc.        |
+         -- +--------------------------------+
+sha1s of    | SHA-1                          |
+slices      | ...                            |
+         -- +--------------------------------+
+fanout      | fanout[0x00]                   |
+table       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+            | fanout[0xff]                   |
+         -- +--------------------------------+
+index       | SHA-1 of object                |
+entries     | index of cache slice SHA-1     |
+            | position in cache slice        |
+            +--------------------------------+
+            |                                |
+            ...
+            +--------------------------------+
+----
+
+The header:
+
+----
+struct rc_index_header {
+	char signature[8]; /* REVINDEX */
+	unsigned char version;
+	uint32_t ofs_objects;
+
+	uint32_t object_nr;
+	unsigned char cache_nr;
+
+	uint32_t max_date;
+};
+----
+
+Explanations:
+
+`signature`::
+	Always "REVINDEX".
+`version`::
+	Version number, currently 1.
+`ofs_objects`::
+	Offset at which the entry objects begin.  This is more obviously useful
+	in the index because the list of slice SHA-1s is variably-sized.
+`object_nr`::
+	Number of index entry objects present.
+`cache_nr`::
+	Number of cache slices to which the index maps, and hence the number of
+slice SHA-1s listed.
+`max_date`::
+	The oldest commit represented in the index.  This is used to help speed
+up lookup times by knowing what range of commits we definitely don't have
+cached.  Normal usage of 'rev-cache' would leave no "holes" in its coverage of
+commit history -- once a commit is cached, everything reachable from it should
+be cached as well.  Most of the time refs are added to rev-cache simultaneous
+as well.  This means that in most situations almost everything <= `max_date`
+will be cached.
+
+Mechanics
+~~~~~~~~~
+
+The most important part of rev-cache is its method of encoding topological
+relations.  To ensure fluid traversal and reconstruction, commits are related
+through high-level "streams"/"channels" rather than individual
+interconnections.  Intuitively, rev-cache stores history the way gitk shows it:
+commits strung up on lines, which interconnect at merges and branches.
+
+Each commit is associated to a given channel/path via a 'path id', and
+variable-length fields govern which paths (if any) are closed or opened at that
+object.  This means that topo-data can be preserved in only a few bytes extra
+per object entry.  Other information stored per entry is the sha-1 hash, type,
+date, size, name, and status in cache slice.  Here is format of an object
+entry, both on-disk and in-memory:
+
+----
+struct object_entry {
+        unsigned type : 3;
+        unsigned is_end : 1;
+        unsigned is_start : 1;
+        unsigned uninteresting : 1;
+        unsigned include : 1;
+        unsigned flags : 1;
+        unsigned char sha1[20];
+
+        unsigned char merge_nr;
+        unsigned char split_nr;
+        unsigned size_size : 3;
+        unsigned name_size : 3;
+
+        uint32_t date;
+        uint16_t path;
+
+        /* merge paths */
+        /* split paths */
+        /* size */
+        /* name index */
+};
+----
+
+An explanation of each field:
+
+`type`::
+	Object type
+`is_end`::
+	The commit has some parents outside the cache slice (all if slice has
+	legs)
+`is_start`::
+	The commit has no children in cache slice
+`uninteresting`::
+	Run-time flag, used in traversal
+`include`::
+	Run-time flag, used in traversal (initialization)
+`flags`::
+	Currently unused, extra bit
+`sha1`::
+	Object SHA-1 hash
+
+`merge_nr`::
+	The number of paths the current channel diverges into; the current path
+	ends upon any merge.
+`split_nr`::
+	The number of paths this commit ends; used on both merging and
+	branching.
+`size_size`::
+	Number of bytes the object size takes up.
+`name_size`::
+	Number of bytes the name index takes up.
+
+`date`::
+	The date of the commit.
+`path`::
+	The path ID of the channel with which this commit is associated.
+
+merge paths::
+	The path IDs (16-bit) that are to be created.  Overflow is not a
+	problem as path IDs are reused, leaving even complicated projects to
+	consume no more than a few hundred IDs.
+split paths::
+	The path IDs (16-bit) that are to be ended.
+size::
+	The size split into the minimum number of bytes.  That is, 1-8 bytes
+	representing the size, least-significant byte first.
+name index::
+	An offset for the null-seperated, object name list at the end of the
+	cache slice.  Also split into the minimum number of bytes.
+
+Each path ID refers to an index in a 'path array', which stores the current
+status (eg. active, interestingness) of each channel.
+
+Due to topo-relations and boundary tracking, all of a commit's parents must be
+encountered before the path is reallocated.  This is achieved by using a
+counter system per merge: starting at the parent number, the counter is
+decremented as each parent is encountered (dictated by 'split paths'); at 0 the
+path is cleared.
+
+Boundary tracking is necessary because non-commits are stored relative to the
+commit in which they were introduced.  If a series of commits is not included
+in the output, the last interesting commit must be parsed manually to ensure
+all objects are accounted for.
+
+To prevent list-objects from recursing into trees that we've already taken care
+of, the flag `FACE_VALUE` is introduced.  An object with this flag is not
+explored (= "taken at face value"), significantly reducing I/O and processing
+time.
-- 
tg: (797cc19..) t/revcache/docs (depends on: t/revcache/integration)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-08-16 22:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-08  7:31 [PATCH 1/6 (v2)] revision caching documentation: man page and technical discussion Nick Edelen
2009-08-08 18:31 ` Junio C Hamano
2009-08-09 14:01   ` Nick Edelen
2009-08-09 18:16     ` Junio C Hamano
2009-08-16 22:41 ` Nick Edelen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).