Re: Index format v5 - Thomas Gummerer

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Gummerer <t.gummerer@gmail.com>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
	peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Mon, 21 May 2012 09:45:25 +0200	[thread overview]
Message-ID: <20120521074525.GA1054@tgummerer> (raw)
In-Reply-To: <4FB7998B.2030305@alum.mit.edu>


Thanks a lot for your feedback.

On 05/19, Michael Haggerty wrote:
> I've looked over the writing side of git-convert-index.py version
> 81411fe6c98, and here are my first comments:
> 
> * Please remove trailing whitespace from the source code.
> 
> * I suggest that you move constants and code shared by
>   git-convert-index.py and git-read-index-v5.py into a library.  Though
>   actually, given that git doesn't seem to have infrastructure for
>   dealing with Python libraries, this might take some improvisation.

I've created a directory python/lib, where I'll put the python libraries.
I'm not entirely sure this is the correct way to do it, however since
the python code will not be in the main git, but is just a prototype,
I think it's fine.

For now I moved the format strings, the structs, the exceptions and
the new calculate_crc method to the library. Once I go over
git-read-index-v5.py I'll probably move more code there.

> * Please use constants for all of the struct formats.  Constants have
>   names, making them mostly self-documenting.
> 
> * write_directories() currently writes pathnames and fake data and
>   stores file offsets in memory.  Later write_directory_data() runs
>   through the file again, seek()ing over the filenames and filling in
>   real data.
> 
>   Wouldn't it be easier for the first pass just to *compute* and
>   record the offsets of the entries to RAM, without writing anything
>   to disk, and leave all of the writing to the second pass?

I don't think that would be easier, since I have to go over all the
data when writing anyway. It might however be faster.

> * Instead of writing blank data, it is possible to seek() past it and
>   start writing the next thing.  The skipped-over file contents are
>   logically initialized to zero.
> 
> * When working with iteritems(), it is clearer to unpack the item
>   pairs and give them names rather than working with d[0] and d[1];
>   for example,
> 
>     -    for d in sorted(dirdata.iteritems()):
>     +    for (pathname,entry) in sorted(dirdata.iteritems()):
> 
> * write_directories() returns a "dirdata" that is just an empty
>   defaultdict.  This seems pointless.  Do you have future plans to
>   change write_directories() to store something into the dictionary?
> 
> * The documentation for binascii.crc32() mentions that it gives
>   inconsistent results (signed vs. unsigned) for different versions of
>   Python.  Please ensure that you are using it in a way that is
>   maximally portable.  (That seems to imply using (binascii.crc32(...)
>   & 0xffffffff) and treating the result as unsigned.)
> 
> * At first I thought it was a little bit odd that you pass data
>   structures around as dictionaries, but I didn't object.  But as I
>   look at more and more code it seems more and more cumbersome.
>   Therefore, I suggest that you define classes to hold the various
>   entities that are manipulated by your programs, because:
> 
>   * A class definition is a good place to document exactly what fields
>     an object is expected to have, and what they mean.
> 
>   * Access of instance fields (entry.path) is easier to read and type
>     than dictionary access (entry["path"]).
> 
>   * The class definitions will translate pretty directly to C structs.
> 
>   The fact that class instances use a bit more memory than
>   dictionaries is, I think, unimportant.  But if that really bothers
>   you, you can use __slots__ to save some of the instance memory.
> 
> At a higher level:
> 
> * What if the offsets to each section were stored in the header, and
>   the offsets recorded for dirs and files were relative to the start
>   of the section (rather than relative to the start of the file)?  I
>   think that this would leave open the possibility of formatting the
>   sections in memory in parallel in a single pass, then dumping the
>   sections to disk in a few big writes (though I'm not saying that this
>   should be the *default* way of writing).

I'm thinking if there are any drawbacks doing it that way, but until
now no drawbacks came to my mind. Otherwise this sounds like a good
idea, and I'll include it in the next version.

> * Do you plan to write prototypes for some of the cool new
>   functionality that v5 is intended to make possible?  For example,
> 
>   * reading a few specific entries out of an index file
> 
>   * updating single entries
> 
>   * adding/removing conflict data to an existing file
> 
>   * dealing with all of the issues that will come with supporting the
>     mutation of an existing index file (i.e., locking, consistency
>     checks, etc)
> 
>   As you probably know from discussions on IRC, I think that the last
>   of these is the biggest risk to the success of the project.

I thought of implementing prototypes as the proejct goes on, but not
all of them now. I'd rather first start implementing the reader,
because otherwise the time could get a problem for the midterm.

--
Thomas

next prev parent reply	other threads:[~2012-05-21  7:45 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03   ` Junio C Hamano
2012-05-04  7:12   ` Michael Haggerty
2012-05-07 22:18     ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36   ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11   ` Thomas Rast
2012-05-03 19:31   ` Thomas Rast
2012-05-03 19:32     ` Thomas Rast
2012-05-03 20:32       ` Junio C Hamano
2012-05-03 21:38   ` Thomas Gummerer
2012-05-07 18:57     ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44   ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46   ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44   ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08   ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11   ` Thomas Gummerer
2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
2012-05-10  6:53         ` Thomas Gummerer
2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
2012-05-09  8:37     ` Michael Haggerty
2012-05-10 12:19       ` Thomas Gummerer
2012-05-10 18:17         ` Michael Haggerty
2012-05-11 17:12           ` Thomas Gummerer
2012-05-13 19:50             ` Michael Haggerty
2012-05-14 15:01               ` Thomas Gummerer
2012-05-14 21:08                 ` Michael Haggerty
2012-05-14 22:10                   ` Thomas Rast
2012-05-15  6:43                     ` Michael Haggerty
2012-05-15 13:49                   ` Thomas Gummerer
2012-05-15 15:02                     ` Michael Haggerty
2012-05-18 15:38                       ` Thomas Gummerer
2012-05-19 13:00                         ` Michael Haggerty
2012-05-21  7:45                           ` Thomas Gummerer [this message]
2012-05-16  5:01                     ` Michael Haggerty
2012-05-16 21:54                       ` Thomas Gummerer
2012-05-19  5:40                         ` Michael Haggerty
2012-05-21 20:30                           ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120521074525.GA1054@tgummerer \
    --to=t.gummerer@gmail.com \
    --cc=davidbarr@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).