From: Michael Haggerty <mhagger@alum.mit.edu>
To: Thomas Gummerer <t.gummerer@gmail.com>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Mon, 14 May 2012 23:08:58 +0200 [thread overview]
Message-ID: <4FB1746A.6090408@alum.mit.edu> (raw)
In-Reply-To: <20120514150113.GD2107@tgummerer>
On 05/14/2012 05:01 PM, Thomas Gummerer wrote:
> Thanks a lot for your feedback. I've now refactored the code and
> thanks to your suggestions hopefully made it simpler and easier
> to read. The reader should now read exactly the data from the
> spec.
Yes, the style is much better now. Here is the next round of feedback:
1. f is still a global variable. This is unnecessary.
2. read_name() is indented incorrectly.
3. The signature and version number should be checked within
read_header() *before* reading anything else. Otherwise, if some random
file is given to the script, it will read some random, probably very
large number for "nextensions" and then try to read that number of
extension offsets into RAM. It would be a pretty innocent error to have
the wrong version of the index file lying around, and the script should
detect such an error quickly and painlessly rather than triggering a lot
of pointless disk activity (and likely an out-of-memory error) before
figuring out that it shouldn't be reading the file in the first place.
4. For readability reasons, it is better to raise an exception
immediately in an error situation rather than put the error-handling
code in the "else" part of an if statement; i.e., instead of
if not error_condition:
non-error-actions
else:
raise exception
use
if error_condition:
raise exception
non-error-actions
The reasons are: (a) the reader can see immediately what will happen in
the error case, then forget it and continue on with the "normal" flow
instead of having to remember the "if" condition through the whole long
"then" block before finally seeing the point in the "else" block; (b)
then the "normal" case doesn't have to be within the if statement at
all, thereby requiring less block nesting and less indentation. For
example:
> - if crc == datacrc:
> - return dict(signature=signature, vnr=vnr, ndir=ndir, nfile=nfile,
> - nextensions=nextensions, extoffsets=extoffsets)
> - else:
> - raise Exception("Wrong header crc")
> + if crc != datacrc:
> + raise Exception("Wrong header crc")
> +
> + return dict(signature=signature, vnr=vnr, ndir=ndir, nfile=nfile,
> + nextensions=nextensions, extoffsets=extoffsets)
5. If the first limit of a range() or xrange() is zero, it is usually
omitted; e.g., "xrange(0, nextensions)" -> "xrange(nextensions)".
6. It is possible to precompile "struct" patterns, which should be
faster and also allows you to ask the Struct object its size, reducing
the number of magic numbers needed in the code. And fixed-length
strings can also be read via struct. For example:
> DIR_DATA_STRUCT = struct.Struct("!HIIIIII 20s")
>
>
> def read_dirs(f, ndir):
> dirs = list()
> for i in xrange(0, ndir):
> (pathname, partialcrc) = read_name(f)
>
> (filedata, partialcrc) = read_calc_crc(f, DIR_DATA_STRUCT.size, partialcrc)
> (flags, foffset, cr, ncr, nsubtrees, nfiles,
> nentries, objname) = DIR_DATA_STRUCT.unpack(filedata)
> # ...
7. The "if dirnr == 0" stuff in read_files() should probably be done in
read_index_entries() (the first invocation is the only place dirnr==0,
right?)
8. read_files() only returns a result "if len(directories) > dirnr".
But actually I don't see the purpose for the check. Earlier in the
function you dereference directories[dirnr] several times, so it *must*
be that dirnr < len(directories). I think you can take out this test
and unindent its block.
9. read_files() doesn't need to return "entries". Since entries is an
array that is only mutated in place, the return value will always be the
same as the "entries" argument (albeit fuller).
10. It would make sense to extract a function read_file(), which reads a
single file entry from the current position in the current file (using
code from read_files()). Similarly for read_dir()/read_dirs().
11. It is good form to move the file-level code into a main() function,
then call that from the bottom of the file, something like this:
> def main(args):
> ....
>
> main(sys.argv[1:])
This avoids creating global variables that are accidentally used within
functions.
What is your plan for testing this code, and later the C version? For
example, you might want to have a suite of index files with various
contents, and compare the "git ls-files --debug" output with the output
that is expected. How would you create index files like this? Via git
commands? Or should one of your Python scripts be taught how to do it?
To make testing easier, you probably don't want to hard-code the name of
the input file in git-read-index-v5.py, so that you can use it to read
arbitrary files. For example, you might want to honor the
GIT_INDEX_FILES environment variable in some form, or to take the name
of the index file as a command-line argument.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
next prev parent reply other threads:[~2012-05-14 21:16 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03 ` Junio C Hamano
2012-05-04 7:12 ` Michael Haggerty
2012-05-07 22:18 ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36 ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
2012-05-03 19:31 ` Thomas Rast
2012-05-03 19:32 ` Thomas Rast
2012-05-03 20:32 ` Junio C Hamano
2012-05-03 21:38 ` Thomas Gummerer
2012-05-07 18:57 ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44 ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46 ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44 ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08 ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11 ` Thomas Gummerer
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
2012-05-10 6:53 ` Thomas Gummerer
2012-05-10 11:06 ` Nguyen Thai Ngoc Duy
2012-05-09 8:37 ` Michael Haggerty
2012-05-10 12:19 ` Thomas Gummerer
2012-05-10 18:17 ` Michael Haggerty
2012-05-11 17:12 ` Thomas Gummerer
2012-05-13 19:50 ` Michael Haggerty
2012-05-14 15:01 ` Thomas Gummerer
2012-05-14 21:08 ` Michael Haggerty [this message]
2012-05-14 22:10 ` Thomas Rast
2012-05-15 6:43 ` Michael Haggerty
2012-05-15 13:49 ` Thomas Gummerer
2012-05-15 15:02 ` Michael Haggerty
2012-05-18 15:38 ` Thomas Gummerer
2012-05-19 13:00 ` Michael Haggerty
2012-05-21 7:45 ` Thomas Gummerer
2012-05-16 5:01 ` Michael Haggerty
2012-05-16 21:54 ` Thomas Gummerer
2012-05-19 5:40 ` Michael Haggerty
2012-05-21 20:30 ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54 ` Thomas Gummerer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FB1746A.6090408@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=davidbarr@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=spearce@spearce.org \
--cc=t.gummerer@gmail.com \
--cc=trast@student.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).