From: Michael Haggerty <mhagger@alum.mit.edu>
To: Thomas Rast <trast@student.ethz.ch>
Cc: Thomas Gummerer <t.gummerer@gmail.com>,
git@vger.kernel.org, gitster@pobox.com, peff@peff.net,
spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Fri, 04 May 2012 09:12:46 +0200 [thread overview]
Message-ID: <4FA3816E.8090005@alum.mit.edu> (raw)
In-Reply-To: <87obq5p1t0.fsf@thomas.inf.ethz.ch>
On 05/03/2012 08:16 PM, Thomas Rast wrote:
> Thomas Gummerer<t.gummerer@gmail.com> writes:
>
>> 32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>> ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
> [...]
>> [7] Since all stat data (except mtime and ctime) is just used for
>> checking if a file has changed a checksum of the data is enough.
>> In addition to that Thomas Rast suggested ctime could be ditched
>> completely (core.trustctime=false) and thus included in the
>> checksum. This would save 24 bytes per index entry, which would
>> be about 4 MB on the Webkit index.
>> (Thanks for the suggestion to Michael Haggerty)
>
> This is the part I'm most curious about. Are we missing anything?
> Michael brought it up on IRC: the stat() results are only used to test
> whether they are still the same, with the exception of the mtime (which
> also undergoes raciness checks).
>
> As far as I can see, none of st_{ino,dev,uid,gid} are useful for
> anything. st_size might conceivably be used as a hint for a buffer
> size, but nobody actually does that. The ctime undergoes stricter
> checks, but AFAICS it's also all about whether it has changed, and
> besides that can be turned off. We think all of those fields can be
> replaced by an arbitrary hash/CRC and only tested for equality. 32 bits
> should be plenty, probably even if we just xor the values together.
XOR is definitely *not* adequate; for example, changing uid=gid="you" to
uid=gid="me" would not affect the XOR of the values (assuming, as is
often the case, that each user has his own uid/gid with the same
numerical values).
Which hash to use depends on some estimate of the likelihood that the
hashes collide and simultaneously that the other metadata coincide. It
seems to me that CRC-32 would be adequate. But if not, a longer hash
could be used (albeit with less space savings).
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
next prev parent reply other threads:[~2012-05-04 7:19 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03 ` Junio C Hamano
2012-05-04 7:12 ` Michael Haggerty [this message]
2012-05-07 22:18 ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36 ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
2012-05-03 19:31 ` Thomas Rast
2012-05-03 19:32 ` Thomas Rast
2012-05-03 20:32 ` Junio C Hamano
2012-05-03 21:38 ` Thomas Gummerer
2012-05-07 18:57 ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44 ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46 ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44 ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08 ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11 ` Thomas Gummerer
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
2012-05-10 6:53 ` Thomas Gummerer
2012-05-10 11:06 ` Nguyen Thai Ngoc Duy
2012-05-09 8:37 ` Michael Haggerty
2012-05-10 12:19 ` Thomas Gummerer
2012-05-10 18:17 ` Michael Haggerty
2012-05-11 17:12 ` Thomas Gummerer
2012-05-13 19:50 ` Michael Haggerty
2012-05-14 15:01 ` Thomas Gummerer
2012-05-14 21:08 ` Michael Haggerty
2012-05-14 22:10 ` Thomas Rast
2012-05-15 6:43 ` Michael Haggerty
2012-05-15 13:49 ` Thomas Gummerer
2012-05-15 15:02 ` Michael Haggerty
2012-05-18 15:38 ` Thomas Gummerer
2012-05-19 13:00 ` Michael Haggerty
2012-05-21 7:45 ` Thomas Gummerer
2012-05-16 5:01 ` Michael Haggerty
2012-05-16 21:54 ` Thomas Gummerer
2012-05-19 5:40 ` Michael Haggerty
2012-05-21 20:30 ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54 ` Thomas Gummerer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FA3816E.8090005@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=davidbarr@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=spearce@spearce.org \
--cc=t.gummerer@gmail.com \
--cc=trast@student.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).