Re: Index format v5 - Michael Haggerty

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Michael Haggerty <mhagger@alum.mit.edu>
To: Thomas Gummerer <t.gummerer@gmail.com>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
	peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Wed, 09 May 2012 10:37:03 +0200	[thread overview]
Message-ID: <4FAA2CAF.3040408@alum.mit.edu> (raw)
In-Reply-To: <20120508141137.GA3937@tgummerer.surfnet.iacbox>

On 05/08/2012 04:11 PM, Thomas Gummerer wrote:
>> * The details of the extension data blocks are described in the first
>>    (overview) section, whereas it seems like they should be described
>>    in their own section following the "conflict data" section.  But
>>    wouldn't the presence of extension data blocks prevent the addition
>>    of conflict data?
>
> Only the details that should be there for every extension are described
> in the overview (the header of the extension), to make sure every
> extension has the same header format, and thus a reader which doesn't
> understand a specific extension still can read its header and know
> what's going on.
>
> They won't prevent the addition of conflicted data, since when a
> conflict is created, other files were probably added and the index has
> to be rewritten anyway. Once the conflict is resolved however only a
> bit has to be flipped, so there is no rewrite necessary.

In other words, the presence of extensions *does indeed* prevent the 
addition of conflict data, but you don't think that it is a problem.

Moving the conflict data to after the extensions, on the other hand, 
would mean that conflict data can sometimes be added without a rewrite. 
  I cannot judge whether this would be useful.

Handling conflict data *as* an extension would allow the conflict data 
to be added at any time without rewriting.  I cannot judge whether this 
would be useful.

>> * Does the index file include directory entries for empty directories?
>>    What about directories that contain only other directories?
>
> In theory the index is able to include empty directories. I'm however
> not sure if this should be implemented. I'd be happy to get more
> feedback there.

Currently git does not keep track of empty directories.  Even though 
there have been proposals to fix this, it is far beyond the scope of 
your project to implement the handling of empty directories.  The 
question is whether your format definition *forbids* the presence of 
empty directories in the index file (in the interest of definiteness, 
and it might make the reader implementation a little bit simpler, but it 
imposes a constraint on the writer).  Obviously empty directories, even 
if present, mustn't have an effect on the SHA1 of the trees containing them.

>> Directory entry
>> ===============
>>
>> * "4-byte number of entries in the index that is covered by the tree
>>    this entry represents."  What does this include?
>>    Files/directories/both?  Recursive or non-recursive?
>
> This is from the cache-tree. I'm not sure but I think it includes both
> files and directories, recursively.

Please figure this out for the final spec.

>> File entry
>> ==========
>> [...]
>
>> * Are file entries sorted by entire path or only by the basename?
>
> They are sorted by the basename, in the respective block of their
> directories.
> Example: paths: a/a a/z b/b
> File entries in the index:
> a ...
> z ...
> b ...

OK, so in other words, the file entries of all files in a directory (not 
including files in subdirectories) are stored contiguously, sorted by 
basename.  (The thing that wasn't immediately clear is whether files 
from subdirectories are intermingled with those of the parent directory.)

>> Flat loading
>> ============
>>
>> * I found the explanation pretty incomprehensible.  Perhaps some
>>    pseudo-code would make it clearer?
>> [...]
> [...] I have changed the flat loading in the documentation,
> hope it's more understandable now.

Maybe it's just be, but I still don't think it is very clear.  Here is 
version fbf8add1b026:

> == Flat loading
>
> Since internally git expects and works with lexicografic ordering,
> a simple linear scan throught the subdirectories doesn't give
> the right internal sorting. To achieve the right internal sorting
> the loading will be done in the following way:
>
> 1. Start with the root directory, and read also the name of the
>   first subdirectory (=next directory in the list).
>
> 1a. Use the next directory (the one against which the filenames
>   were checked previously), and read the next directory name,
>   to check the files against.
>
> 2. Check the stack if the element at the top is < then the current
>   directoryname.
>
>   If it's < then current directory name, add files from the stack
>     to the entry list, until the file name is > then the
>     directory name.
>
> 2. While filename < directoryname add the filenames to the entry
>   list
>
> 3. Add the rest of the files to a stack.
>
> 4. Continue with 1a, if there are more directories left.
>
> 5. Add the rest of the files from the stack to the end of the
>   entry list.

Aside from the fact that there are two number (2)s,

* What does "Use the next directory (the one against which the filenames 
were checked previously)" mean?  What does it mean to "use a directory"? 
  Does it mean to recurse into the directory?  Is the stack preserved 
passed down to the recursive function calls, or does each level of the 
recursion have its own stack?  What does "against which the filenames 
were checked previously" mean (there are no filenames mentioned in the 
earlier steps)?

* You talk about a stack, and "Add the rest of the files to a stack". 
But when you retrieve entries from a stack, they come out in reverse 
order.  So are you imagining that each element of the stack is an array 
of file entries?  Or do you push the files onto the stack in reverse 
order?  Or do you really mean a queue rather than a stack?

* Are the file entries read before they are put on the stack, or does 
the stack just remember where to read them from when their turn comes?

* "Continue with 1a, if there are more directories left": I assume you 
mean subdirectories of the current directory, but maybe you are talking 
about all directories?

There is a reason that I asked for pseudocode, namely because it forces 
you to be more precise in your description.  I can certainly imagine 
several workable algorithms for reading the index file, and the 
different algorithms have different tradeoffs particularly regarding the 
amount of temporary space needed and locality of reference in the index 
file (which, I understand, will be mmapped when practical but it is not 
practical on all platforms).  Once you express the algorithm in 
pseudocode it is possible to be sure which variant you have chosen and 
consider whether it is really workable.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

next prev parent reply	other threads:[~2012-05-09  8:44 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03   ` Junio C Hamano
2012-05-04  7:12   ` Michael Haggerty
2012-05-07 22:18     ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36   ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11   ` Thomas Rast
2012-05-03 19:31   ` Thomas Rast
2012-05-03 19:32     ` Thomas Rast
2012-05-03 20:32       ` Junio C Hamano
2012-05-03 21:38   ` Thomas Gummerer
2012-05-07 18:57     ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44   ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46   ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44   ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08   ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11   ` Thomas Gummerer
2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
2012-05-10  6:53         ` Thomas Gummerer
2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
2012-05-09  8:37     ` Michael Haggerty [this message]
2012-05-10 12:19       ` Thomas Gummerer
2012-05-10 18:17         ` Michael Haggerty
2012-05-11 17:12           ` Thomas Gummerer
2012-05-13 19:50             ` Michael Haggerty
2012-05-14 15:01               ` Thomas Gummerer
2012-05-14 21:08                 ` Michael Haggerty
2012-05-14 22:10                   ` Thomas Rast
2012-05-15  6:43                     ` Michael Haggerty
2012-05-15 13:49                   ` Thomas Gummerer
2012-05-15 15:02                     ` Michael Haggerty
2012-05-18 15:38                       ` Thomas Gummerer
2012-05-19 13:00                         ` Michael Haggerty
2012-05-21  7:45                           ` Thomas Gummerer
2012-05-16  5:01                     ` Michael Haggerty
2012-05-16 21:54                       ` Thomas Gummerer
2012-05-19  5:40                         ` Michael Haggerty
2012-05-21 20:30                           ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FAA2CAF.3040408@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=davidbarr@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --cc=t.gummerer@gmail.com \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).