From: Robin Rosenberg <robin.rosenberg@dewire.com>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: Thomas Rast <trast@student.ethz.ch>,
Thomas Gummerer <t.gummerer@gmail.com>,
git@vger.kernel.org, gitster@pobox.com, peff@peff.net,
spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Tue, 08 May 2012 00:18:26 +0200 [thread overview]
Message-ID: <4FA84A32.7070607@dewire.com> (raw)
In-Reply-To: <4FA3816E.8090005@alum.mit.edu>
Michael Haggerty skrev 2012-05-04 09.12:
> On 05/03/2012 08:16 PM, Thomas Rast wrote:
>> Thomas Gummerer<t.gummerer@gmail.com> writes:
>>
>>> 32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>>> ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
>> [...]
>>> [7] Since all stat data (except mtime and ctime) is just used for
>>> checking if a file has changed a checksum of the data is enough.
>>> In addition to that Thomas Rast suggested ctime could be ditched
>>> completely (core.trustctime=false) and thus included in the
>>> checksum. This would save 24 bytes per index entry, which would
>>> be about 4 MB on the Webkit index.
>>> (Thanks for the suggestion to Michael Haggerty)
>>
>> This is the part I'm most curious about. Are we missing anything?
>> Michael brought it up on IRC: the stat() results are only used to test
>> whether they are still the same, with the exception of the mtime (which
>> also undergoes raciness checks).
>>
>> As far as I can see, none of st_{ino,dev,uid,gid} are useful for
>> anything. st_size might conceivably be used as a hint for a buffer
>> size, but nobody actually does that. The ctime undergoes stricter
>> checks, but AFAICS it's also all about whether it has changed, and
>> besides that can be turned off. We think all of those fields can be
>> replaced by an arbitrary hash/CRC and only tested for equality. 32 bits
>> should be plenty, probably even if we just xor the values together.
>
> XOR is definitely *not* adequate; for example, changing uid=gid="you" to uid=gid="me"
> would not affect the XOR of the values (assuming, as is often the case, that each user
> has his own uid/gid with the same numerical values).
If you change uid/gid, that has no relevance for the content that git tracks. If the CRC
is equal you have to check the content. Ideally a change that does not change the content
should not change the CRC either, so there is really no absolute need to see that change.
I assume the idea is that if you do "tar xvf" or something like that, then changes in file,
mtime etc could be picked up by looking at these attributes, but it seems that those that
mess with mtime such that it goes back in time are out of luck with git anyway.
> Which hash to use depends on some estimate of the likelihood that the hashes collide and
> simultaneously that the other metadata coincide. It seems to me that CRC-32 would
> be adequate. But if not, a longer hash could be used (albeit with less space savings).
>
> Michael
>
JGit simply ignores ctime, ino, dev, uid and gid. The real reason is of course that
standard Java does not have an API for these extra attributes. On the the other hand
nobody is going to fix this bug. The reason is that if you follow the rule that mtime
must always change to "now" if content change, then all changes will be found simply
by looking at mtime or performing a content check for the racy case. Those that mess
with mtime tend to be unhappy anyway.
Then there is the issue of how often we can detect changes without checking content. Ino
usually changes, but when it changes mtime usually does too, so how often does it speed
up.
Has anyone instrumented git to see how much the different attributes actually
contribute to performance and accuracy?
I'd like to extend the size field to 64 bits. We rarely need the extra bits, but we
cannot differ between 3 bytes and 4294967299 bytes so avoiding the very expensive
content check there would be welcome, even it it's a rare event. I haven't thought
too much about this though. I just felt uncomfortable when looking at the code and
knowing that performing a content check of a 4 GB file could take a minute or two.
-- robin
next prev parent reply other threads:[~2012-05-07 22:19 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03 ` Junio C Hamano
2012-05-04 7:12 ` Michael Haggerty
2012-05-07 22:18 ` Robin Rosenberg [this message]
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36 ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
2012-05-03 19:31 ` Thomas Rast
2012-05-03 19:32 ` Thomas Rast
2012-05-03 20:32 ` Junio C Hamano
2012-05-03 21:38 ` Thomas Gummerer
2012-05-07 18:57 ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44 ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46 ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44 ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08 ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11 ` Thomas Gummerer
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
2012-05-10 6:53 ` Thomas Gummerer
2012-05-10 11:06 ` Nguyen Thai Ngoc Duy
2012-05-09 8:37 ` Michael Haggerty
2012-05-10 12:19 ` Thomas Gummerer
2012-05-10 18:17 ` Michael Haggerty
2012-05-11 17:12 ` Thomas Gummerer
2012-05-13 19:50 ` Michael Haggerty
2012-05-14 15:01 ` Thomas Gummerer
2012-05-14 21:08 ` Michael Haggerty
2012-05-14 22:10 ` Thomas Rast
2012-05-15 6:43 ` Michael Haggerty
2012-05-15 13:49 ` Thomas Gummerer
2012-05-15 15:02 ` Michael Haggerty
2012-05-18 15:38 ` Thomas Gummerer
2012-05-19 13:00 ` Michael Haggerty
2012-05-21 7:45 ` Thomas Gummerer
2012-05-16 5:01 ` Michael Haggerty
2012-05-16 21:54 ` Thomas Gummerer
2012-05-19 5:40 ` Michael Haggerty
2012-05-21 20:30 ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54 ` Thomas Gummerer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FA84A32.7070607@dewire.com \
--to=robin.rosenberg@dewire.com \
--cc=davidbarr@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=mhagger@alum.mit.edu \
--cc=peff@peff.net \
--cc=spearce@spearce.org \
--cc=t.gummerer@gmail.com \
--cc=trast@student.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).