[PATCH 1/5] revision caching documentation: man page and technical discussion

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/5] revision caching documentation: man page and technical discussion
@ 2009-08-06  9:55 Nick Edelen
  2009-08-07  3:08 ` Sam Vilain
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Edelen @ 2009-08-06  9:55 UTC (permalink / raw)
  To: Junio C Hamano, Johannes Schindelin, Jeff King, Sam Vilain,
	Shawn O. Pearce

Before any code is introduced the full documentation is put forth.  This 
provides a man page for the porcelain, and a technical doc in technical/.  The 
latter describes the API, and discusses rev-cache's design, file format and 
mechanics.

Signed-off-by: Nick Edelen <sirnot@gmail.com>

---
 Documentation/rev-cache.txt           |   51 +++++
 Documentation/technical/rev-cache.txt |  336 +++++++++++++++++++++++++++++++++
 2 files changed, 387 insertions(+), 0 deletions(-)

diff --git a/Documentation/rev-cache.txt b/Documentation/rev-cache.txt
new file mode 100755
index 0000000..64bd051
--- /dev/null
+++ b/Documentation/rev-cache.txt
@@ -0,0 +1,51 @@
+rev-cache porcelain
+===================
+
+A front end for the rev-cache API is provided with the builtin utility 
+`rev-cache`.  It is mainly intended for cache slice generation and maintenance, 
+but can also walk commits within a slice.  At the moment it is not particularly 
+advanced, but is sufficient for repository administration.
+
+It's general syntax is:
+
+`git-rev-cache COMMAND [options] [<commit-id>...]`
+
+With the commands:
+
+`add`::
+	Add revisions to the cache.  Reads commit ids from stdin, formatted as:
+	`END END ... \--not START START ...`
++
+Options:
+
+`\--all`:: Include all heads as ends.
+`\--fresh`:: Exclude everything already in a cache slice.
+`\--stdin`:: Also read commit ids from stdin (seperated by newline, `\--not` 
+also valid).
+`\--legs`:: Ensure branch has no "dangling" starts (ie. is self-contained).
+`\--noobjects`:: Don't include non-commit objects.
+
+`walk`::
+	Walk a cache slice based on set of commits; formatted as add.
++
+Options:
+
+`\--objects`:: Include non-commit objects in traversal.
+
+`fuse`::
+	Coagulate several cache slices into a single large slice.
++
+Options:
+
+`\--all`:: Include all objects in repository.  Will only traverse as of cache 
+ends if this is not specified.
+`\--noobjects`:: Don't include non-commit objects.
+`\--ignore-size[=N]`:: Do not fuse slices of file size >= `N`.  If `N` is not 
+given the cutoff size defaults to ~5MB.
+
+`index`::
+	Regenerate the cache index.
+
+
+For an explanation of the API and its inner workings, see 
+link:technical/rev-cache.txt[technical info on rev-cache]
diff --git a/Documentation/technical/rev-cache.txt b/Documentation/technical/rev-cache.txt
new file mode 100755
index 0000000..e95ec89
--- /dev/null
+++ b/Documentation/technical/rev-cache.txt
@@ -0,0 +1,336 @@
+rev-cache
+=========
+
+The revision cache API ('rev-cache') provides a method for efficiently storing 
+and accessing commit branch sections.  Such branch slices are defined by a 
+series of top (interesting) and bottom (uninteresting) commits.  Each slice 
+contains, per commit:
+
+* All intra-slice topological relations, encoded into path "channels" (see 
+  'Mechanics' for full explanation).
+* Object meta-data: type, SHA-1, size, date (for commits).
+* Objects introduced in that commit, relative to slice (ie. only for non-start 
+  commits).
+
+Storage data structures are not exported, in part to keep git's global scope 
+clean, but largely because they're pretty much useless outside of rev-cache.
+
+The API
+-------
+
+The API for 'rev-cache' is quite simple.  You can find the function prototypes 
+in `revision.h`.
+
+Data Structures
+~~~~~~~~~~~~~~~
+
+The `rev_cache_info` struct holds all the options and flags for the API.
+
+----
+struct rev_cache_info {
+        /* generation flags */
+        unsigned objects : 1, 
+                legs : 1, 
+                make_index : 1;
+        
+        /* traversal flags */
+        unsigned save_unique : 1, 
+                add_to_pending : 1;
+        
+        /* fuse options */
+        unsigned int ignore_size;
+};
+----
+
+The fields:
+
+`objects`:: Add non-commit objects to slice.
+`legs`:: Ensure bottoms have no childrens.
+`make_index`:: Integrate newly-made slice into index.
+`save_unique`:: Load unique non-commit objects into `unique` field of each 
+`commit` object.
+`add_to_pending`:: Append unique non-commit objects to the `pending` object 
+list in the passed `rev_info` instance.
+`ignore_size`:: If non-zero, ignore slices with size greater or equal to this.
+
+Functions
+~~~~~~~~~
+
+`init_rci`::
+
+        Initiate a `rev_cache_info` struct to default options.  
+
+`make_cache_slice`::
+
+        Create a cache based on an a `rev_info` instance or `commit_list` s of 
+        "tops" and "bottoms" (defaulting to latter if `rev_info` pointer is 
+        NULL), copying the cache SHA-1 into a passed pointer if non-zero.  A 
+        `rev_cache_info` struct pointer can be passed to set options, while 
+        passing NULL will set default options.  A last parameter can 
+        optionally recieve the final cache hash.
+
+`make_cache_index`::
+
+        Add a slice to the cache index.  Requires a file descriptor, the cache 
+        hash and the file size.  Note that this is normally called by 
+        `make_cache_slice` under the `make_index` option.
+
+`get_cache_slice`::
+
+        Given a commit SHA-1 `get_cache_slice` will search the slice index and 
+        return, if found, the cache-identifying SHA-1.
+
+`traverse_cache_slice`::
+
+        Traverse a specified cache slice based on:
+
+        * `rev_cache_info` instance (optional)
+        * cache SHA-1 identifier
+        * `rev_info` instance
+        * a starting commit and commit work list
+        * date of oldest encountered interesting commit
+        * current `slop` (this and above mainly used in integration with 
+          revision walker)
+        
++
+The output is sent to a FILO `commit_list` "queue", while any bottom commits 
+are passed back into the work list.  If the walk is not contained within the 
+slice, commit boundaries are also inserted into "work".
+
+`tops_from_slices`::
+
+        Will mark all top-commits in the specified cache slices with a given 
+        flag, and add them to the rev pending list.  Will include all if no 
+        slices are specified.
+
+`coagulate_cache_slices`::
+
+        Generate a slice based on the passed `rev_info` instance, replacing all 
+        encountered slices with one (larger) slice.  The `ignore_size` field in 
+        `rci`, if non-zero, will dictate which cache slice sizes to ignore in 
+        both traversal and replacement.
+
+`regenerate_index`::
+
+        Remake cache index.
+
+Example Usage
+-------------
+
+A few examples to demonstrate usage:
+
+.Creating a slice
+----
+/* pretend you're a porcelain for rev-cache reading from the command line */
+struct rev_info revs;
+struct rev_cache_info rci;
+
+init_revisions(&revs, 0);
+init_rci(&rci);
+
+flags = 0;
+for (i = 1; i < argc; i++) {
+        if (!strcmp(argv[i], "--not"))
+                flags ^= UNINTERESTING;
+        else if(!strcmp(argv[i], "--fresh"))
+                starts_from_slices(&revs, UNINTERESTING, 0, 0);
+        else
+                handle_revision_arg(argv[i], &revs, flags, 1);
+}
+
+/* we want to explicitly set certain options */
+rci.objects = 0;
+
+if (!make_cache_slice(&rci, &revs, 0, 0, cache_sha1))
+        printf("made slice!  it's called %s\n", sha1_to_hex(cache_sha1));
+----
+
+.Traversing a slice
+----
+/* let's say you're walking the tree with a 'work' list of current heads and a 
+ * FILO output list 'out' */
+out = 0;
+outp = &out;
+
+while (work) {
+        struct commit *commit = pop_commit(&work);
+        struct object *object = &commit->object;
+        unsigned char *cache_sha1;
+        
+        if (cache_sha1 = get_cache_slice(object->sha1)) {
+                /* note that this will instatiate any topo-relations 
+                 * as it goes */
+                if (traverse_cache_slice(&revs, cache_sha1, 
+                        commit, 0, 0, /* use defaults */
+                        &outp, &work) < 0)
+                        die("I'm overreacting to a non-fatal cache error");
+        } else {
+                struct commit_list *parents = commit->parents;
+                
+                while (parents) {
+                        struct commit *p = parents->item;
+                        struct object *po = &p->object;
+                        
+                        parents = parents->next;
+                        if (po->flags & UNINTERESTING)
+                                continue;
+                        
+                        if (object->flags & UNINTERESTING)
+                                po->flags |= UNINTERESTING;
+                        else if (po->flags & SEEN)
+                                continue;
+                        
+                        if (!po->parsed)
+                                parse_commit(p);
+                        insert_by_date(p, &work);
+                }
+                
+                if (object->flags & (SEEN | UNINTERESTING) == 0)
+                        outp = &commit_list_insert(commit, outp);
+                object->flags |= SEEN;
+        }
+}
+----
+
+Some Internals
+--------------
+
+Although you really don't need to know anything about how rev-cache actually 
+does its magic shizz, a bit of background may go a long way if you're wading 
+through the source.
+
+File Formats
+~~~~~~~~~~~~
+
+A slice has a basic fixed-size header, followed by a certain number of object 
+entries.  Commits are sorted in topo-order, and each commit entry is followed 
+by the objects added in that commit.
+
+----
+         -- +--------------------------------+
+header      | object number, etc...          |
+         -- +--------------------------------+
+commit      | commit info                    |
+entry       | path data                      |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | ...                            |
+         -- +--------------------------------+
+commit      | commit info                    |
+entry       | path data                      |
+            +--------------------------------+
+            | tree/blob info                 |
+            +--------------------------------+
+            | ...                            |
+         -- +--------------------------------+
+...         ...                               
+            +--------------------------------+
+----
+
+The index is somewhat similar to pack-file indexes, containing a fanout table 
+and a list of index entries sorted by hash.
+
+----
+         -- +--------------------------------+
+header      | object #, cache #, etc.        |
+         -- +--------------------------------+
+cache       | SHA-1                          |
+sha1s       | ...                            |
+         -- +--------------------------------+
+fanout      | fanout[0x00]                   |
+table       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+            | fanout[0xff]                   |
+         -- +--------------------------------+
+index       | entry SHA-1                    |
+entries     | cache sha1 index               |
+            +--------------------------------+
+            |                                |
+            ...                               
+            +--------------------------------+
+----
+
+All the relavant structures are readily accessible in `rev-cache.c`
+
+Mechanics
+~~~~~~~~~
+
+The most important part of rev-cache is its method of encoding topological 
+relations.  To ensure fluid traversal and reconstruction, commits are related 
+through high-level "streams"/"channels" rather than individual 
+interconnections.  Intuitively, rev-cache stores history the way gitk shows it: 
+commits strung up on lines, which interconnect at merges and branches.
+
+Each commit is associated to a given channel/path via a 'path id', and 
+variable-length fields govern which paths (if any) are closed or opened at that 
+object.  This means that topo-data can be preserved in only a few bytes extra 
+per object entry.  Other information stored per entry is the sha-1 hash, type, 
+date, size, and status in cache slice.  Here is format of an object entry, both 
+on-disk and in-memory:
+
+----
+struct object_entry {
+        unsigned type : 3;
+        unsigned is_end : 1;
+        unsigned is_start : 1;
+        unsigned uninteresting : 1;
+        unsigned include : 1;
+        unsigned flags : 1;
+        unsigned char sha1[20];
+        
+        unsigned merge_nr : 6;
+        unsigned split_nr : 7;
+        unsigned size_size : 3;
+        
+	uint32_t date;
+	uint16_t path;
+        
+        /* merge paths */
+        /* split paths */
+        /* size */
+};
+----
+
+An explanation of each field:
+
+`type`:: Object type
+`is_end`:: The commit has some parents outside the cache slice (all if slice 
+has legs)
+`is_start`:: The commit has no children in cache slice
+`uninteresting`:: Run-time flag, used in traversal
+`include`:: Run-time flag, used in traversal (initialization)
+`flags`:: Currently unused, extra bit
+`sha`:: Object SHA-1 hash
+
+`merge_nr`:: The number of paths the current channel diverges into; the current 
+path ends upon any merge.
+`split_nr`:: The number of paths this commit ends; used on both merging and 
+branching.
+`size_size`:: Number of bytes the object size takes up.
+
+merge paths:: The path IDs (16-bit) that are to be created.
+split paths:: The path IDs (16-bit) that are to be ended.
+size:: The size split into the minimum number of bytes.
+
+Each path ID refers to an index in a 'path array', which stores the current 
+status (eg. active, interestingness) of each channel.
+
+Due to topo-relations and boundary tracking, all of a commit's parents must be 
+encountered before the path is reallocated.  This is achieved by using a 
+counter system per merge: starting at the parent number, the counter is 
+decremented as each parent is encountered (dictated by 'split paths'); at 0 the 
+path is cleared.
+
+Boundary tracking is necessary because non-commits are stored relative to the 
+commit in which they were introduced.  If a series of commits is not included 
+in the output, the last interesting commit must be parsed manually to ensure 
+all objects are accounted for.
+
+To prevent list-objects from recursing into trees that we've already taken care 
+of, the flag `FACE_VALUE` is introduced.  An object with this flag is not 
+explored (= "taken at face value"), significantly reducing I/O and processing 
+time.
+
+(NSE)
-- 

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/5] revision caching documentation: man page and technical discussion
  2009-08-06  9:55 [PATCH 1/5] revision caching documentation: man page and technical discussion Nick Edelen
@ 2009-08-07  3:08 ` Sam Vilain
  2009-08-07 12:07   ` Rogan Dawes
  0 siblings, 1 reply; 5+ messages in thread
From: Sam Vilain @ 2009-08-07  3:08 UTC (permalink / raw)
  To: Nick Edelen
  Cc: Junio C Hamano, Johannes Schindelin, Jeff King, Shawn O. Pearce,
	Andreas Ericsson, Christian Couder, git@vger.kernel.org

Nick Edelen wrote:
> Before any code is introduced the full documentation is put forth.  This 
> provides a man page for the porcelain, and a technical doc in technical/.  The 
> latter describes the API, and discusses rev-cache's design, file format and 
> mechanics.
>
> Signed-off-by: Nick Edelen <sirnot@gmail.com>
>
> ---
>  Documentation/rev-cache.txt           |   51 +++++
>  Documentation/technical/rev-cache.txt |  336 +++++++++++++++++++++++++++++++++
>  2 files changed, 387 insertions(+), 0 deletions(-)
>   

A couple of minor nits here:

1. Documentation/rev-cache.txt should be
Documentation/git-rev-cache.txt, because it documents that command.

2. there are many whitespace errors:

wilber:~/src/git$ git am \[PATCH\ 1_5\]\ revision\ caching\
documentation\:\ man\ page\ and\ technical\ discussion.eml
Applying: revision caching documentation: man page and technical discussion
/home/samv/src/git/.git/rebase-apply/patch:15: trailing whitespace.
A front end for the rev-cache API is provided with the builtin utility
/home/samv/src/git/.git/rebase-apply/patch:16: trailing whitespace.
`rev-cache`.  It is mainly intended for cache slice generation and
maintenance,
/home/samv/src/git/.git/rebase-apply/patch:17: trailing whitespace.
but can also walk commits within a slice.  At the moment it is not
particularly
/home/samv/src/git/.git/rebase-apply/patch:34: trailing whitespace.
`\--stdin`:: Also read commit ids from stdin (seperated by newline,
`\--not`
/home/samv/src/git/.git/rebase-apply/patch:51: trailing whitespace.
`\--all`:: Include all objects in repository.  Will only traverse as of
cache
warning: squelched 166 whitespace errors
warning: 171 lines add whitespace errors.

Be sure to check the patch doesn't do that...

> diff --git a/Documentation/rev-cache.txt b/Documentation/rev-cache.txt
> new file mode 100755
> index 0000000..64bd051
> --- /dev/null
> +++ b/Documentation/rev-cache.txt
> @@ -0,0 +1,51 @@
> +rev-cache porcelain
> +===================
> +
> +A front end for the rev-cache API is provided with the builtin utility 
> +`rev-cache`.  It is mainly intended for cache slice generation and maintenance, 
> +but can also walk commits within a slice.  At the moment it is not particularly 
> +advanced, but is sufficient for repository administration.
>   

That last sentence is a bit unnecessarily self-doubting; "It currently
provides basic functionality" would be fine.

> +
> +It's general syntax is:
> +
> +`git-rev-cache COMMAND [options] [<commit-id>...]`
> +
> +With the commands:
>   

You should use the same structure and headings of this file as the other
man pages which are found under Documentation/git-*.txt

> +
> +`add`::
> +	Add revisions to the cache.  Reads commit ids from stdin, formatted as:
> +	`END END ... \--not START START ...`
>   

Really, it's reading a revision list; might be better to call it that. 
Can it accept the same options as git-rev-list here or just a list of
starting points/tips?  And "END" and "START" are still backwards to the
rest of git core!

> ++
> +Options:
> +
> +`\--all`:: Include all heads as ends.
> +`\--fresh`:: Exclude everything already in a cache slice.
>   

If you use the paragraph layout used by other commands you can explain a
little bit more about what you mean here.  By "Cache Slice" do you mean
"Revision Cache"?  Does that --fresh imply that all of the revisions
which were not indexed will automatically get indexed?

> +`\--stdin`:: Also read commit ids from stdin (seperated by newline, `\--not` 
> +also valid).
> +`\--legs`:: Ensure branch has no "dangling" starts (ie. is self-contained).
>   

The --legs term is a little too 'cute', is there a better way to
describe this?  And "starts" is not a real word in that context; if you
are using a technical term, you should define it..

> +`\--noobjects`:: Don't include non-commit objects.
> +
> +`walk`::
> +	Walk a cache slice based on set of commits; formatted as add.
>   

So, this works like 'rev-list' ?

"Formatted" is wrong there I think; do you mean it 'accepts the same
arguments as add' ?

> ++
> +Options:
> +
> +`\--objects`:: Include non-commit objects in traversal.
> +
>   

Which was the default?  --objects or --no-objects?

> +`fuse`::
> +	Coagulate several cache slices into a single large slice.
>   

Coagulate?  You mean, the revision caches will stop being liquid and go
gluggy, like a pool of blood clotting?

How about "combine" :-) - and the option might be better called
something simple like that, too.

> ++
> +Options:
> +
> +`\--all`:: Include all objects in repository.  Will only traverse as of cache 
> +ends if this is not specified.
>   

That sentence doesn't quite read well to me.

> +`\--noobjects`:: Don't include non-commit objects.
> +`\--ignore-size[=N]`:: Do not fuse slices of file size >= `N`.  If `N` is not 
> +given the cutoff size defaults to ~5MB.
> +
> +`index`::
> +	Regenerate the cache index.
> +
> +
> +For an explanation of the API and its inner workings, see 
> +link:technical/rev-cache.txt[technical info on rev-cache]
> diff --git a/Documentation/technical/rev-cache.txt b/Documentation/technical/rev-cache.txt
> new file mode 100755
> index 0000000..e95ec89
> --- /dev/null
> +++ b/Documentation/technical/rev-cache.txt
> @@ -0,0 +1,336 @@
> +rev-cache
> +=========
> +
> +The revision cache API ('rev-cache') provides a method for efficiently storing 
> +and accessing commit branch sections.  Such branch slices are defined by a 
> +series of top (interesting) and bottom (uninteresting) commits.  Each slice 
> +contains, per commit:
> +
> +* All intra-slice topological relations, encoded into path "channels" (see 
> +  'Mechanics' for full explanation).
> +* Object meta-data: type, SHA-1, size, date (for commits).
> +* Objects introduced in that commit, relative to slice (ie. only for non-start 
> +  commits).
> +
> +Storage data structures are not exported, in part to keep git's global scope 
> +clean, but largely because they're pretty much useless outside of rev-cache.
> +
> +The API
> +-------
> +
> +The API for 'rev-cache' is quite simple.  You can find the function prototypes 
> +in `revision.h`.
> +
> +Data Structures
> +~~~~~~~~~~~~~~~
> +
> +The `rev_cache_info` struct holds all the options and flags for the API.
> +
> +----
> +struct rev_cache_info {
> +        /* generation flags */
> +        unsigned objects : 1, 
> +                legs : 1, 
> +                make_index : 1;
> +        
> +        /* traversal flags */
> +        unsigned save_unique : 1, 
> +                add_to_pending : 1;
> +        
> +        /* fuse options */
> +        unsigned int ignore_size;
> +};
> +----
> +
> +The fields:
> +
> +`objects`:: Add non-commit objects to slice.
> +`legs`:: Ensure bottoms have no childrens.
>   

"children" is already plural and needs no 's'

> +`make_index`:: Integrate newly-made slice into index.
> +`save_unique`:: Load unique non-commit objects into `unique` field of each 
> +`commit` object.
>   

Unique how?  Do you mean, that they were introduced in this slice and
not reachable from any of the bottom/end commits?

> +`add_to_pending`:: Append unique non-commit objects to the `pending` object 
> +list in the passed `rev_info` instance.
>   

That doesn't make much sense to me either.  What does that mean?  Well,
I'll read on.

> +`ignore_size`:: If non-zero, ignore slices with size greater or equal to this.
>   

What will this ignoring mean?

> +Functions
> +~~~~~~~~~
> +
> +`init_rci`::
> +
> +        Initiate a `rev_cache_info` struct to default options.  
> +
> +`make_cache_slice`::
> +
> +        Create a cache based on an a `rev_info` instance or `commit_list` s of 
> +        "tops" and "bottoms" (defaulting to latter if `rev_info` pointer is 
> +        NULL), copying the cache SHA-1 into a passed pointer if non-zero.  A 
> +        `rev_cache_info` struct pointer can be passed to set options, while 
> +        passing NULL will set default options.  A last parameter can 
> +        optionally recieve the final cache hash.
> +
> +`make_cache_index`::
> +
> +        Add a slice to the cache index.  Requires a file descriptor, the cache 
> +        hash and the file size.  Note that this is normally called by 
> +        `make_cache_slice` under the `make_index` option.
> +
> +`get_cache_slice`::
> +
> +        Given a commit SHA-1 `get_cache_slice` will search the slice index and 
> +        return, if found, the cache-identifying SHA-1.
> +
> +`traverse_cache_slice`::
> +
> +        Traverse a specified cache slice based on:
> +
> +        * `rev_cache_info` instance (optional)
> +        * cache SHA-1 identifier
> +        * `rev_info` instance
> +        * a starting commit and commit work list
> +        * date of oldest encountered interesting commit
> +        * current `slop` (this and above mainly used in integration with 
> +          revision walker)
>   

Hmm what's a 'slop' ?

> +        
> ++
> +The output is sent to a FILO `commit_list` "queue", while any bottom commits 
> +are passed back into the work list.  If the walk is not contained within the 
> +slice, commit boundaries are also inserted into "work".
> +
> +`tops_from_slices`::
> +
> +        Will mark all top-commits in the specified cache slices with a given 
> +        flag, and add them to the rev pending list.  Will include all if no 
> +        slices are specified.
> +
> +`coagulate_cache_slices`::
> +
> +        Generate a slice based on the passed `rev_info` instance, replacing all 
> +        encountered slices with one (larger) slice.  The `ignore_size` field in 
> +        `rci`, if non-zero, will dictate which cache slice sizes to ignore in 
> +        both traversal and replacement.
> +
> +`regenerate_index`::
> +
> +        Remake cache index.
> +
> +Example Usage
> +-------------
> +
> +A few examples to demonstrate usage:
> +
> +.Creating a slice
> +----
> +/* pretend you're a porcelain for rev-cache reading from the command line */
> +struct rev_info revs;
> +struct rev_cache_info rci;
> +
> +init_revisions(&revs, 0);
> +init_rci(&rci);
> +
> +flags = 0;
> +for (i = 1; i < argc; i++) {
> +        if (!strcmp(argv[i], "--not"))
> +                flags ^= UNINTERESTING;
> +        else if(!strcmp(argv[i], "--fresh"))
> +                starts_from_slices(&revs, UNINTERESTING, 0, 0);
> +        else
> +                handle_revision_arg(argv[i], &revs, flags, 1);
> +}
> +
> +/* we want to explicitly set certain options */
> +rci.objects = 0;
> +
> +if (!make_cache_slice(&rci, &revs, 0, 0, cache_sha1))
> +        printf("made slice!  it's called %s\n", sha1_to_hex(cache_sha1));
>   

Heh, normally it's acceptable to let examples in documentation not be
complete working programs :)

> +----
> +
> +.Traversing a slice
> +----
> +/* let's say you're walking the tree with a 'work' list of current heads and a 
> + * FILO output list 'out' */
> +out = 0;
> +outp = &out;
> +
> +while (work) {
> +        struct commit *commit = pop_commit(&work);
> +        struct object *object = &commit->object;
> +        unsigned char *cache_sha1;
> +        
> +        if (cache_sha1 = get_cache_slice(object->sha1)) {
> +                /* note that this will instatiate any topo-relations 
> +                 * as it goes */
> +                if (traverse_cache_slice(&revs, cache_sha1, 
> +                        commit, 0, 0, /* use defaults */
> +                        &outp, &work) < 0)
> +                        die("I'm overreacting to a non-fatal cache error");
> +        } else {
> +                struct commit_list *parents = commit->parents;
> +                
> +                while (parents) {
> +                        struct commit *p = parents->item;
> +                        struct object *po = &p->object;
> +                        
> +                        parents = parents->next;
> +                        if (po->flags & UNINTERESTING)
> +                                continue;
> +                        
> +                        if (object->flags & UNINTERESTING)
> +                                po->flags |= UNINTERESTING;
> +                        else if (po->flags & SEEN)
> +                                continue;
> +                        
> +                        if (!po->parsed)
> +                                parse_commit(p);
> +                        insert_by_date(p, &work);
> +                }
> +                
> +                if (object->flags & (SEEN | UNINTERESTING) == 0)
> +                        outp = &commit_list_insert(commit, outp);
> +                object->flags |= SEEN;
> +        }
> +}
> +----
>   

Ok, good - perhaps needs a comment or two - not much, it's already long.

> +
> +Some Internals
> +--------------
> +
> +Although you really don't need to know anything about how rev-cache actually 
> +does its magic shizz, a bit of background may go a long way if you're wading 
> +through the source.
>   

"does its work" will do nicely..

> +
> +File Formats
> +~~~~~~~~~~~~
> +
> +A slice has a basic fixed-size header, followed by a certain number of object 
> +entries.  Commits are sorted in topo-order, and each commit entry is followed 
> +by the objects added in that commit.
>   

Starting from the top or the bottom?  And it might pay to clarify which
objects are included or not; ie objects reachable from "bottom" commits
or not.

> +
> +----
> +         -- +--------------------------------+
> +header      | object number, etc...          |
> +         -- +--------------------------------+
> +commit      | commit info                    |
> +entry       | path data                      |
> +            +--------------------------------+
> +            | tree/blob info                 |
> +            +--------------------------------+
> +            | tree/blob info                 |
> +            +--------------------------------+
> +            | ...                            |
> +         -- +--------------------------------+
> +commit      | commit info                    |
> +entry       | path data                      |
> +            +--------------------------------+
> +            | tree/blob info                 |
> +            +--------------------------------+
> +            | ...                            |
> +         -- +--------------------------------+
> +...         ...                               
> +            +--------------------------------+
> +----
> +
> +The index is somewhat similar to pack-file indexes, containing a fanout table 
> +and a list of index entries sorted by hash.
> +
> +----
> +         -- +--------------------------------+
> +header      | object #, cache #, etc.        |
> +         -- +--------------------------------+
> +cache       | SHA-1                          |
> +sha1s       | ...                            |
> +         -- +--------------------------------+
> +fanout      | fanout[0x00]                   |
> +table       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +            | fanout[0xff]                   |
> +         -- +--------------------------------+
> +index       | entry SHA-1                    |
> +entries     | cache sha1 index               |
> +            +--------------------------------+
> +            |                                |
> +            ...                               
> +            +--------------------------------+
> +----
> +
> +All the relavant structures are readily accessible in `rev-cache.c`
> +
>   

Ok right so which is the on-disk container format?  The index or the
slice?  I'm a little confused..

> +Mechanics
> +~~~~~~~~~
> +
> +The most important part of rev-cache is its method of encoding topological 
> +relations.  To ensure fluid traversal and reconstruction,

Is it still fluid after coagulation?  ;-)

>  commits are related 
> +through high-level "streams"/"channels" rather than individual 
> +interconnections.  Intuitively, rev-cache stores history the way gitk shows it: 
> +commits strung up on lines, which interconnect at merges and branches.
> +
> +Each commit is associated to a given channel/path via a 'path id', and 
> +variable-length fields govern which paths (if any) are closed or opened at that 
> +object.  This means that topo-data can be preserved in only a few bytes extra 
> +per object entry.  Other information stored per entry is the sha-1 hash, type, 
> +date, size, and status in cache slice.  Here is format of an object entry, both 
> +on-disk and in-memory:
> +
> +----
> +struct object_entry {
> +        unsigned type : 3;
> +        unsigned is_end : 1;
> +        unsigned is_start : 1;
> +        unsigned uninteresting : 1;
> +        unsigned include : 1;
> +        unsigned flags : 1;
> +        unsigned char sha1[20];
> +        
> +        unsigned merge_nr : 6;
> +        unsigned split_nr : 7;
> +        unsigned size_size : 3;
> +        
> +	uint32_t date;
> +	uint16_t path;
> +        
> +        /* merge paths */
> +        /* split paths */
> +        /* size */
> +};
> +----
> +
> +An explanation of each field:
> +
> +`type`:: Object type
> +`is_end`:: The commit has some parents outside the cache slice (all if slice 
> +has legs)
> +`is_start`:: The commit has no children in cache slice
> +`uninteresting`:: Run-time flag, used in traversal
> +`include`:: Run-time flag, used in traversal (initialization)
> +`flags`:: Currently unused, extra bit
> +`sha`:: Object SHA-1 hash
> +
> +`merge_nr`:: The number of paths the current channel diverges into; the current 
> +path ends upon any merge.
> +`split_nr`:: The number of paths this commit ends; used on both merging and 
> +branching.
> +`size_size`:: Number of bytes the object size takes up.
> +
> +merge paths:: The path IDs (16-bit) that are to be created.
> +split paths:: The path IDs (16-bit) that are to be ended.
> +size:: The size split into the minimum number of bytes.
> +
> +Each path ID refers to an index in a 'path array', which stores the current 
> +status (eg. active, interestingness) of each channel.
> +
> +Due to topo-relations and boundary tracking, all of a commit's parents must be 
> +encountered before the path is reallocated.  This is achieved by using a 
> +counter system per merge: starting at the parent number, the counter is 
> +decremented as each parent is encountered (dictated by 'split paths'); at 0 the 
> +path is cleared.
> +
> +Boundary tracking is necessary because non-commits are stored relative to the 
> +commit in which they were introduced.  If a series of commits is not included 
> +in the output, the last interesting commit must be parsed manually to ensure 
> +all objects are accounted for.
> +
> +To prevent list-objects from recursing into trees that we've already taken care 
> +of, the flag `FACE_VALUE` is introduced.  An object with this flag is not 
> +explored (= "taken at face value"), significantly reducing I/O and processing 
> +time.
> +
> +(NSE)
>   

tl;dr but that's ok it's magic shizz.

Anyway looking nice ... see what I can say about the next patches.

sam

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/5] revision caching documentation: man page and technical discussion
  2009-08-07  3:08 ` Sam Vilain
@ 2009-08-07 12:07   ` Rogan Dawes
  2009-08-07 12:20     ` Johannes Schindelin
  0 siblings, 1 reply; 5+ messages in thread
From: Rogan Dawes @ 2009-08-07 12:07 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nick Edelen, Junio C Hamano, Johannes Schindelin, Jeff King,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Sam Vilain wrote:

>> +`fuse`::
>> +	Coagulate several cache slices into a single large slice.
>>   
> 
> Coagulate?  You mean, the revision caches will stop being liquid and go
> gluggy, like a pool of blood clotting?
> 
> How about "combine" :-) - and the option might be better called
> something simple like that, too.

I think the word he had in mind was "coalesce".

Rogan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/5] revision caching documentation: man page and technical discussion
  2009-08-07 12:07   ` Rogan Dawes
@ 2009-08-07 12:20     ` Johannes Schindelin
  2009-08-07 21:58       ` Sam Vilain
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Schindelin @ 2009-08-07 12:20 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: Sam Vilain, Nick Edelen, Junio C Hamano, Jeff King,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

Hi,

On Fri, 7 Aug 2009, Rogan Dawes wrote:

> Sam Vilain wrote:
> 
> >> +`fuse`::
> >> +	Coagulate several cache slices into a single large slice.
> >>   
> > 
> > Coagulate?  You mean, the revision caches will stop being liquid and go
> > gluggy, like a pool of blood clotting?
> > 
> > How about "combine" :-) - and the option might be better called
> > something simple like that, too.
> 
> I think the word he had in mind was "coalesce".

As Git users typically have a quite good idea what a "merge" is, I'd 
prefer that word anyway.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/5] revision caching documentation: man page and technical discussion
  2009-08-07 12:20     ` Johannes Schindelin
@ 2009-08-07 21:58       ` Sam Vilain
  0 siblings, 0 replies; 5+ messages in thread
From: Sam Vilain @ 2009-08-07 21:58 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Rogan Dawes, Nick Edelen, Junio C Hamano, Jeff King,
	Shawn O. Pearce, Andreas Ericsson, Christian Couder,
	git@vger.kernel.org

On Fri, 2009-08-07 at 14:20 +0200, Johannes Schindelin wrote:
> > I think the word he had in mind was "coalesce".
> As Git users typically have a quite good idea what a "merge" is, I'd 
> prefer that word anyway.

I don't like that so much, given that "merge" has quite a specific
meaning.  But not really that fussed.  In this case it's more like a
'gc' or 'repack'; an internal reshuffle of an index to make it a single
one and not many.

Sam

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-08-07 21:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-06  9:55 [PATCH 1/5] revision caching documentation: man page and technical discussion Nick Edelen
2009-08-07  3:08 ` Sam Vilain
2009-08-07 12:07   ` Rogan Dawes
2009-08-07 12:20     ` Johannes Schindelin
2009-08-07 21:58       ` Sam Vilain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).