Re: Index format v5 - Thomas Gummerer

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Gummerer <t.gummerer@gmail.com>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
	peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Fri, 11 May 2012 19:12:30 +0200	[thread overview]
Message-ID: <20120511171230.GA2107@tgummerer> (raw)
In-Reply-To: <4FAC0633.90809@alum.mit.edu>

On 05/10, Michael Haggerty wrote:
> On 05/10/2012 02:19 PM, Thomas Gummerer wrote:
> >== Flat loading
> >
> >Since internally git expects and works with lexicografic ordering,
> >a simple linear scan throught the subdirectories doesn't give
> >the right internal sorting. To achieve the right internal sorting
> >the loading will be done in the following way:
> >
> >The data structure is a stack of queues, to allow continous reading
> >of the file.
> >
> >s ->  queue1
> >t ->  queue2
> >a ->  queue3
> >c ->  queue4
> >k ->  queue5
> >
> >dirs = read_all_directories
> >
> >foreach dir in dirs do
> >     file = read_next_file
> >
> >     while element_on_top_of_stack.first_element<  nextdir
> >         indexentries.append(dequeue(element_on_top_of_stack))
> >         if element_on_top_of_stack == emtpy:
> >             remove_element_on_top_of_stack
> >
> >     if file[filename]<  nextdir
> >         indexentries.append(file)
> >     else
> >         queue.add(file)
> >         foreach f in rest_of_files_in_directory:
> >             queue.add(f)
> >         stack.push(queue)
> >
> >foreach queue in stack:
> >     foreach entry in queue:
> >         indexentry.append(entry)
> 
> 1. It seems to me that the first time that a file with a filename
> before nextdir is encountered, the reading of the directory
> containing the file will be terminated.  This would, for example, be
> the case for any directory that contains multiple files but no
> subdirectories.
> 
> 2. There is still a lot that is unnecessarily obscure.  For example,
> I suppose (but you don't say) that "rest_of_files_in_directory"
> means to read the files at that moment.  It would be more explicit
> (and no more verbose) to write
> 
>     while (f = read_next_file()) != NULL:
>         queue.add(f)
> 
> 3. You don't cover corner cases, like when read_next_file() is
> called but there are no files left in the directory, or when there
> is no nextdir (which itself is not defined).  OK, this pseudocode is
> only meant to be illustrative, so I guess we can wait until your
> real implementation to see such details.  On the other hand, you
> probably want to get all the details straight in pseudocode or
> Python before you start translating it into C.
> 
> 4. I think the algorithm would be easier to understand and implement
> if it were written recursively.  The call stack would replace your
> explicit stack (but you would still need one queue per directory
> level).  A key observation is that when "nextdir" precedes the next
> file, then all of the files in subdirectories of nextdir do as well.
> Thus an obvious recursion would be to call a function like
> read_all_files_under_dir(indexentries, dirs, dirindex) at this
> point, which would consume all of the directories that are
> subdirectories of dirs[dirindex] (i.e., all directories whose names
> have the string dirs[dirindex] as a prefix).  Using this method
> would mean that there is no need to compare each file under
> dirs[dirindex] against the next file in the outer directory.
> 
> Michael

Thanks for your feedback! To get clearer code I've now written a
working reader for the v5 index format in Python. The full reader
would probably be to long for the mailing list, but here is the
interesting part:


def readfiles(directories, dirnr, entries):
    global filedata
    f.seek(directories[dirnr]["foffset"])
    offset = struct.unpack("!I", fread(4))[0]
    f.seek(offset)
    filedata = list()
    queue = list()
    i = 0
    while i < directories[dirnr]["nfiles"]:
        filedata.append(struct.pack("!I", f.tell()))
        filename = ""
        byte = fread(1)
        while byte != '\0':
            filename += byte
            byte = fread(1)

        data = struct.unpack("!HHIII", fread(16))
        objhash = fread(20)
        readcrc = struct.pack("!i", binascii.crc32("".join(filedata)))
        crc = f.read(4)
        if readcrc != crc:
            print "Wrong CRC: " + filename
        filedata = list()

        i += 1

        queue.append(dict({"name": directories[dirnr]["pathname"] + filename, "flags": data[0], "mode": data[1], "mtimes": data[2], "mtimens": data[3], "statcrc": data[4], "objhash": binascii.hexlify(objhash)}))

    if len(directories) > dirnr:
        i = 0
        while i < len(queue):
            if len(directories) - 1 > dirnr and queue[i]["name"] > directories[dirnr + 1]["pathname"]:
                entries, dirnr = readfiles(directories, dirnr + 1, entries)
            else:
                entries.append(queue[i])
                i += 1
        return entries, dirnr

The full reader can be found here:
https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py

-- 
Thomas

next prev parent reply	other threads:[~2012-05-11 17:12 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03   ` Junio C Hamano
2012-05-04  7:12   ` Michael Haggerty
2012-05-07 22:18     ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36   ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11   ` Thomas Rast
2012-05-03 19:31   ` Thomas Rast
2012-05-03 19:32     ` Thomas Rast
2012-05-03 20:32       ` Junio C Hamano
2012-05-03 21:38   ` Thomas Gummerer
2012-05-07 18:57     ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44   ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46   ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44   ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08   ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11   ` Thomas Gummerer
2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
2012-05-10  6:53         ` Thomas Gummerer
2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
2012-05-09  8:37     ` Michael Haggerty
2012-05-10 12:19       ` Thomas Gummerer
2012-05-10 18:17         ` Michael Haggerty
2012-05-11 17:12           ` Thomas Gummerer [this message]
2012-05-13 19:50             ` Michael Haggerty
2012-05-14 15:01               ` Thomas Gummerer
2012-05-14 21:08                 ` Michael Haggerty
2012-05-14 22:10                   ` Thomas Rast
2012-05-15  6:43                     ` Michael Haggerty
2012-05-15 13:49                   ` Thomas Gummerer
2012-05-15 15:02                     ` Michael Haggerty
2012-05-18 15:38                       ` Thomas Gummerer
2012-05-19 13:00                         ` Michael Haggerty
2012-05-21  7:45                           ` Thomas Gummerer
2012-05-16  5:01                     ` Michael Haggerty
2012-05-16 21:54                       ` Thomas Gummerer
2012-05-19  5:40                         ` Michael Haggerty
2012-05-21 20:30                           ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120511171230.GA2107@tgummerer \
    --to=t.gummerer@gmail.com \
    --cc=davidbarr@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).