From: Thomas Gummerer <t.gummerer@gmail.com>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Thu, 10 May 2012 14:19:11 +0200 [thread overview]
Message-ID: <20120510121911.GB98491@tgummerer> (raw)
In-Reply-To: <4FAA2CAF.3040408@alum.mit.edu>
On 05/09, Michael Haggerty wrote:
> On 05/08/2012 04:11 PM, Thomas Gummerer wrote:
> >>* The details of the extension data blocks are described in the first
> >> (overview) section, whereas it seems like they should be described
> >> in their own section following the "conflict data" section. But
> >> wouldn't the presence of extension data blocks prevent the addition
> >> of conflict data?
> >
> >Only the details that should be there for every extension are described
> >in the overview (the header of the extension), to make sure every
> >extension has the same header format, and thus a reader which doesn't
> >understand a specific extension still can read its header and know
> >what's going on.
> >
> >They won't prevent the addition of conflicted data, since when a
> >conflict is created, other files were probably added and the index has
> >to be rewritten anyway. Once the conflict is resolved however only a
> >bit has to be flipped, so there is no rewrite necessary.
>
> In other words, the presence of extensions *does indeed* prevent the
> addition of conflict data, but you don't think that it is a problem.
Exactly.
> Moving the conflict data to after the extensions, on the other hand,
> would mean that conflict data can sometimes be added without a
> rewrite. I cannot judge whether this would be useful.
>
> Handling conflict data *as* an extension would allow the conflict
> data to be added at any time without rewriting. I cannot judge
> whether this would be useful.
Since there are offsets in the directory data to the conflicted data
I don't think it's good to call this data extension data. It may
however be beneficial to have the conflict data after the extension.
I'll investigate this.
> >>* Does the index file include directory entries for empty directories?
> >> What about directories that contain only other directories?
> >
> >In theory the index is able to include empty directories. I'm however
> >not sure if this should be implemented. I'd be happy to get more
> >feedback there.
>
> Currently git does not keep track of empty directories. Even though
> there have been proposals to fix this, it is far beyond the scope of
> your project to implement the handling of empty directories. The
> question is whether your format definition *forbids* the presence of
> empty directories in the index file (in the interest of
> definiteness, and it might make the reader implementation a little
> bit simpler, but it imposes a constraint on the writer). Obviously
> empty directories, even if present, mustn't have an effect on the
> SHA1 of the trees containing them.
No, the index format doesn't forbid the presence of empty directories.
Empty directories will have a fileoffset of 0, and the reader will
just ignore them as long as there is no empty directory tracking.
> >>Directory entry
> >>===============
> >>
> >>* "4-byte number of entries in the index that is covered by the tree
> >> this entry represents." What does this include?
> >> Files/directories/both? Recursive or non-recursive?
> >
> >This is from the cache-tree. I'm not sure but I think it includes both
> >files and directories, recursively.
>
> Please figure this out for the final spec.
It includes only files, in a recursive manner. I've written this down
in the spec.
> >>File entry
> >>==========
> >>[...]
> >
> >>* Are file entries sorted by entire path or only by the basename?
> >
> >They are sorted by the basename, in the respective block of their
> >directories.
> >Example: paths: a/a a/z b/b
> >File entries in the index:
> >a ...
> >z ...
> >b ...
>
> OK, so in other words, the file entries of all files in a directory
> (not including files in subdirectories) are stored contiguously,
> sorted by basename. (The thing that wasn't immediately clear is
> whether files from subdirectories are intermingled with those of the
> parent directory.)
Yes, exactly.
> >>Flat loading
> >>============
> >>
> >>* I found the explanation pretty incomprehensible. Perhaps some
> >> pseudo-code would make it clearer?
> >>[...]
> >[...] I have changed the flat loading in the documentation,
> >hope it's more understandable now.
>
> Maybe it's just be, but I still don't think it is very clear. Here
> is version fbf8add1b026:
>
> >== Flat loading
> >
> >Since internally git expects and works with lexicografic ordering,
> >a simple linear scan throught the subdirectories doesn't give
> >the right internal sorting. To achieve the right internal sorting
> >the loading will be done in the following way:
> >
> >1. Start with the root directory, and read also the name of the
> > first subdirectory (=next directory in the list).
> >
> >1a. Use the next directory (the one against which the filenames
> > were checked previously), and read the next directory name,
> > to check the files against.
> >
> >2. Check the stack if the element at the top is < then the current
> > directoryname.
> >
> > If it's < then current directory name, add files from the stack
> > to the entry list, until the file name is > then the
> > directory name.
> >
> >2. While filename < directoryname add the filenames to the entry
> > list
> >
> >3. Add the rest of the files to a stack.
> >
> >4. Continue with 1a, if there are more directories left.
> >
> >5. Add the rest of the files from the stack to the end of the
> > entry list.
>
> [..]
> There is a reason that I asked for pseudocode, namely because it
> forces you to be more precise in your description. I can certainly
> imagine several workable algorithms for reading the index file, and
> the different algorithms have different tradeoffs particularly
> regarding the amount of temporary space needed and locality of
> reference in the index file (which, I understand, will be mmapped
> when practical but it is not practical on all platforms). Once you
> express the algorithm in pseudocode it is possible to be sure which
> variant you have chosen and consider whether it is really workable.
Ok, here is the variant in pseudo code. I hope it's understandable
this way. It needs some temporary space, but never more then the
actual entries will need in the end anyway.
== Flat loading
Since internally git expects and works with lexicografic ordering,
a simple linear scan throught the subdirectories doesn't give
the right internal sorting. To achieve the right internal sorting
the loading will be done in the following way:
The data structure is a stack of queues, to allow continous reading
of the file.
s -> queue1
t -> queue2
a -> queue3
c -> queue4
k -> queue5
dirs = read_all_directories
foreach dir in dirs do
file = read_next_file
while element_on_top_of_stack.first_element < nextdir
indexentries.append(dequeue(element_on_top_of_stack))
if element_on_top_of_stack == emtpy:
remove_element_on_top_of_stack
if file[filename] < nextdir
indexentries.append(file)
else
queue.add(file)
foreach f in rest_of_files_in_directory:
queue.add(f)
stack.push(queue)
foreach queue in stack:
foreach entry in queue:
indexentry.append(entry)
next prev parent reply other threads:[~2012-05-10 12:19 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03 ` Junio C Hamano
2012-05-04 7:12 ` Michael Haggerty
2012-05-07 22:18 ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36 ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
2012-05-03 19:31 ` Thomas Rast
2012-05-03 19:32 ` Thomas Rast
2012-05-03 20:32 ` Junio C Hamano
2012-05-03 21:38 ` Thomas Gummerer
2012-05-07 18:57 ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44 ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46 ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44 ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08 ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11 ` Thomas Gummerer
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
2012-05-10 6:53 ` Thomas Gummerer
2012-05-10 11:06 ` Nguyen Thai Ngoc Duy
2012-05-09 8:37 ` Michael Haggerty
2012-05-10 12:19 ` Thomas Gummerer [this message]
2012-05-10 18:17 ` Michael Haggerty
2012-05-11 17:12 ` Thomas Gummerer
2012-05-13 19:50 ` Michael Haggerty
2012-05-14 15:01 ` Thomas Gummerer
2012-05-14 21:08 ` Michael Haggerty
2012-05-14 22:10 ` Thomas Rast
2012-05-15 6:43 ` Michael Haggerty
2012-05-15 13:49 ` Thomas Gummerer
2012-05-15 15:02 ` Michael Haggerty
2012-05-18 15:38 ` Thomas Gummerer
2012-05-19 13:00 ` Michael Haggerty
2012-05-21 7:45 ` Thomas Gummerer
2012-05-16 5:01 ` Michael Haggerty
2012-05-16 21:54 ` Thomas Gummerer
2012-05-19 5:40 ` Michael Haggerty
2012-05-21 20:30 ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54 ` Thomas Gummerer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120510121911.GB98491@tgummerer \
--to=t.gummerer@gmail.com \
--cc=davidbarr@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=mhagger@alum.mit.edu \
--cc=peff@peff.net \
--cc=spearce@spearce.org \
--cc=trast@student.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).