All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gummerer <t.gummerer@gmail.com>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
	peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Fri, 11 May 2012 19:12:30 +0200	[thread overview]
Message-ID: <20120511171230.GA2107@tgummerer> (raw)
In-Reply-To: <4FAC0633.90809@alum.mit.edu>

On 05/10, Michael Haggerty wrote:
> On 05/10/2012 02:19 PM, Thomas Gummerer wrote:
> >== Flat loading
> >
> >Since internally git expects and works with lexicografic ordering,
> >a simple linear scan throught the subdirectories doesn't give
> >the right internal sorting. To achieve the right internal sorting
> >the loading will be done in the following way:
> >
> >The data structure is a stack of queues, to allow continous reading
> >of the file.
> >
> >s ->  queue1
> >t ->  queue2
> >a ->  queue3
> >c ->  queue4
> >k ->  queue5
> >
> >dirs = read_all_directories
> >
> >foreach dir in dirs do
> >     file = read_next_file
> >
> >     while element_on_top_of_stack.first_element<  nextdir
> >         indexentries.append(dequeue(element_on_top_of_stack))
> >         if element_on_top_of_stack == emtpy:
> >             remove_element_on_top_of_stack
> >
> >     if file[filename]<  nextdir
> >         indexentries.append(file)
> >     else
> >         queue.add(file)
> >         foreach f in rest_of_files_in_directory:
> >             queue.add(f)
> >         stack.push(queue)
> >
> >foreach queue in stack:
> >     foreach entry in queue:
> >         indexentry.append(entry)
> 
> 1. It seems to me that the first time that a file with a filename
> before nextdir is encountered, the reading of the directory
> containing the file will be terminated.  This would, for example, be
> the case for any directory that contains multiple files but no
> subdirectories.
> 
> 2. There is still a lot that is unnecessarily obscure.  For example,
> I suppose (but you don't say) that "rest_of_files_in_directory"
> means to read the files at that moment.  It would be more explicit
> (and no more verbose) to write
> 
>     while (f = read_next_file()) != NULL:
>         queue.add(f)
> 
> 3. You don't cover corner cases, like when read_next_file() is
> called but there are no files left in the directory, or when there
> is no nextdir (which itself is not defined).  OK, this pseudocode is
> only meant to be illustrative, so I guess we can wait until your
> real implementation to see such details.  On the other hand, you
> probably want to get all the details straight in pseudocode or
> Python before you start translating it into C.
> 
> 4. I think the algorithm would be easier to understand and implement
> if it were written recursively.  The call stack would replace your
> explicit stack (but you would still need one queue per directory
> level).  A key observation is that when "nextdir" precedes the next
> file, then all of the files in subdirectories of nextdir do as well.
> Thus an obvious recursion would be to call a function like
> read_all_files_under_dir(indexentries, dirs, dirindex) at this
> point, which would consume all of the directories that are
> subdirectories of dirs[dirindex] (i.e., all directories whose names
> have the string dirs[dirindex] as a prefix).  Using this method
> would mean that there is no need to compare each file under
> dirs[dirindex] against the next file in the outer directory.
> 
> Michael

Thanks for your feedback! To get clearer code I've now written a
working reader for the v5 index format in Python. The full reader
would probably be to long for the mailing list, but here is the
interesting part:


def readfiles(directories, dirnr, entries):
    global filedata
    f.seek(directories[dirnr]["foffset"])
    offset = struct.unpack("!I", fread(4))[0]
    f.seek(offset)
    filedata = list()
    queue = list()
    i = 0
    while i < directories[dirnr]["nfiles"]:
        filedata.append(struct.pack("!I", f.tell()))
        filename = ""
        byte = fread(1)
        while byte != '\0':
            filename += byte
            byte = fread(1)

        data = struct.unpack("!HHIII", fread(16))
        objhash = fread(20)
        readcrc = struct.pack("!i", binascii.crc32("".join(filedata)))
        crc = f.read(4)
        if readcrc != crc:
            print "Wrong CRC: " + filename
        filedata = list()

        i += 1

        queue.append(dict({"name": directories[dirnr]["pathname"] + filename, "flags": data[0], "mode": data[1], "mtimes": data[2], "mtimens": data[3], "statcrc": data[4], "objhash": binascii.hexlify(objhash)}))

    if len(directories) > dirnr:
        i = 0
        while i < len(queue):
            if len(directories) - 1 > dirnr and queue[i]["name"] > directories[dirnr + 1]["pathname"]:
                entries, dirnr = readfiles(directories, dirnr + 1, entries)
            else:
                entries.append(queue[i])
                i += 1
        return entries, dirnr

The full reader can be found here:
https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py

-- 
Thomas

  reply	other threads:[~2012-05-11 17:12 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03   ` Junio C Hamano
2012-05-04  7:12   ` Michael Haggerty
2012-05-07 22:18     ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36   ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11   ` Thomas Rast
2012-05-03 19:31   ` Thomas Rast
2012-05-03 19:32     ` Thomas Rast
2012-05-03 20:32       ` Junio C Hamano
2012-05-03 21:38   ` Thomas Gummerer
2012-05-07 18:57     ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44   ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46   ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44   ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08   ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11   ` Thomas Gummerer
2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
2012-05-10  6:53         ` Thomas Gummerer
2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
2012-05-09  8:37     ` Michael Haggerty
2012-05-10 12:19       ` Thomas Gummerer
2012-05-10 18:17         ` Michael Haggerty
2012-05-11 17:12           ` Thomas Gummerer [this message]
2012-05-13 19:50             ` Michael Haggerty
2012-05-14 15:01               ` Thomas Gummerer
2012-05-14 21:08                 ` Michael Haggerty
2012-05-14 22:10                   ` Thomas Rast
2012-05-15  6:43                     ` Michael Haggerty
2012-05-15 13:49                   ` Thomas Gummerer
2012-05-15 15:02                     ` Michael Haggerty
2012-05-18 15:38                       ` Thomas Gummerer
2012-05-19 13:00                         ` Michael Haggerty
2012-05-21  7:45                           ` Thomas Gummerer
2012-05-16  5:01                     ` Michael Haggerty
2012-05-16 21:54                       ` Thomas Gummerer
2012-05-19  5:40                         ` Michael Haggerty
2012-05-21 20:30                           ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120511171230.GA2107@tgummerer \
    --to=t.gummerer@gmail.com \
    --cc=davidbarr@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.