Git development
 help / color / mirror / Atom feed
* Re: [PATCH] Implement fast hash-collision detection
From: Jeff King @ 2011-11-30  6:25 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds
In-Reply-To: <1322603788.1728.190.camel@yos>

On Tue, Nov 29, 2011 at 01:56:28PM -0800, Bill Zaumen wrote:

> The additional CRC (easily changed to whatever message digest one might
> prefer) makes a malicious attack far more difficult: the modified file
> has to have both the same SHA-1 hash (including the Git header) and 
> the same CRC (not including the Git header).

Only if the attack actually involves creating a collision on both. But I
think the important attacks bypass your CRC anyway. Consider this attack
scenario:

  1. Linus signs a tag (or a commit) and pushes it to kernel.org.

  2. kernel.org gets hacked, and the attacker replaces an object with
     an evil colliding version[1].

  3. I clone from kernel.org, and run "git tag --verify". Git says it's
     OK, because the signature checks out, but I have a bogus object.

How does your CRC help? If I understand your scheme correctly,
kernel.org will have told me the CRC of all of the objects during the
clone. But that isn't part of what Linus signed, so the attacker in step
2 could just as easily have overwritten kernel.org's crc file, and the
signature will remain valid.

[1] This is an over-simplification, of course. Because the only even
    remotely feasible attacks on sha1 are birthday attacks, not pre-image
    attacks, there is a step 0 in which the attacker generates a
    colliding pair, convinces Linus to commit it, and then waits.

    Which is probably really hard, but for the purposes of this
    discussion, we assume the attacker is capable of inserting a
    colliding object maliciously into a repo you will fetch from.
    Otherwise, the integrity of sha1 isn't an issue at all.

> An efficient algorithm to do both simultaneously does not yet exist.
> So, if we could generate a SHA-1 collision in one second, it would
> presumably take billions of seconds (many decades of continuous
> computation) to generate a SHA-1 hash with the same CRC, and well
> before a year has elapsed, the original object should have been in all
> the repositories, preventing a forged object from being inserted. Of
> course, eventually you might need a real message digest.

This is wrong, for two reasons.

  1. The method for generating an object that collides in both sha-1 and
     CRC is not necessarily to generate a colliding sha-1 and then do a
     pre-image attack on the CRC. It is to do a birthday attack on the
     sha-1 and the CRC together. Which halves the bit-strength of the
     CRC to 16 bits (just as we can generally find collisions in 160-bit
     sha1s in 2^80). 16 bits isn't a lot to add when you are trying to
     fix a broken cryptosystem (it's not broken yet, obviously, but when
     it does get broken, will it be because computing reaches the 2^57
     or so that sha1 is broken at, or will it be because a new weakness
     is found that drops sha1's bit-strength to something much lower?).

     This assumes that you can combine the two in a birthday attack.
     Certainly this analysis works against brute-force 2^80 sha1
     collision attacks. But I haven't actually read the details of the
     sha1 attacks, so maybe some of the tweaking they do to get those
     results makes it harder. On the other hand, attacking CRC is far
     from hard, so I certainly wouldn't stake money that sha1 reseachers
     couldn't tweak their attacks in a way that also allows finding CRC
     collisions. You say that an algorithm to do both simultaneously
     does not yet exist. But is that because it's hard, or simply
     because nobody has bothered trying?

     Anyway, all of that is just reiterating that CRC should not be used
     as a security function. It can easily be replaced in your scheme by
     sha-256, which does have the desired properties.

  2. Your attack seems to be "find the sha-1 collision, publish one of
     your colliding objects (i.e., the innocent-looking half), then try
     to break the CRC". And then you claim that by the time you find the
     CRC, everybody will already have the object.

     But wouldn't a smarter attack be to first find the collision, including
     the CRC, and only _then_ start the attack? Then nobody will have
     the object.

     Moreover, it's not true that after a year everyone will have the
     object. People still run "git clone" against kernel.org. Those
     repos do not have the object.

> The weakness of a CRC as an integrity check is not an issue since it
> is never used alone: it's use is more analogous to the few extra bits
> added to a data stream when error-detecting codes are used.  I used a
> CRC in the initial implementation rather than a message digest because
> it is faster, and because the initial goal was to get things to work
> correctly.  In any case, the patch does not eliminate any code in
> which Git already does a byte-by-byte comparison.  In cases where Git
> currently assumes that two objects are the same because the SHA-1
> hashes are the same, the patch compares CRCs as an additional test.

Right. I don't claim that your scheme makes git any weaker. I just claim
that it fails to solve the problems people are actually concerned about,
and it adds a lot of complexity while doing so.

> Regarding your [Jeff's] second concern, "how does this alternative
> digest have any authority?" there are two things to keep in mind.
> First, it is a supplement to the existing digest.

Right, but we are assuming that sha1 is broken. That's the whole
security problem. So the existing digest is not worth much.

> Second, any value of the CRC that is stored permanently (baring bugs,
> in my implementation, of course) is computed locally - when a loose
> object is created or when a pack file's index is created.  At no point
> is a CRC that was obtained from another repository trusted. While the
> patch modifies Git so that it can send CRCs when using the git
> protocol, these CRCs are never stored, but are instead used only for
> cross checks.  If one side or the other "lies", you get an error.

But if I don't already have the object, then I have nothing to compare
against. So when I get it from kernel.org, I have to simply accept that
the object I'm getting is good, and write it into my object db.

> BTW, regarding your [Jeff's] discussion about putting an additional
> header in commit messages - I tried that.  The existing versions of
> Git didn't like it: barring a bug in my test code, it seems that Git
> expects headers in commit messages to be in a particular order and
> treats deviations from that to be an error.

Yes, the header has to go at the end of the existing headers. But I
don't see any reason that would be a problem for the scheme I described.

> I even tried appending blank lines at the end of a commit, with spaces
> and tabs encoding an additional CRC, and that didn't work either - at
> least it never got through all the test programs, failing in places
> like the tests involving notes.

Yes, git will helpfully trim whitespace in commit messages. With the
current code, you can hide arbitrary bytes in a commit message after a
NUL, but don't do that. It's not guaranteed to stay that way, and the
appropriate place to add new information is in a header.

> In any case, you'd have to phase in such a change gradually, first
> putting in the code to read the new header if it is there, and
> subsequently (after ample time so that everyone is running a
> sufficiently new version) enabling the code to create the new header.

Current git should ignore headers that it doesn't understand. I haven't
tested this, but Junio recently has been experimenting with
gpg-signature lines in commits, and I'm pretty sure he checked that
older gits properly ignore them.

-Peff

^ permalink raw reply

* [PATCH 2/3] Add documentation for fast hash collision detection
From: Bill Zaumen @ 2011-11-30  6:12 UTC (permalink / raw)
  To: git; +Cc: gitster

The documentation added is technical documentation describing
how fast hash collision detection operates and changes to
several git commands (a few new command-line arguments, mostly).

Note: the change to the implementation is in the child of
this commit.

Signed-off-by: Bill Zaumen <bill.zaumen+git@gmail.com>
---
 Documentation/git-count-objects.txt          |   12 +-
 Documentation/git-index-pack.txt             |   16 +-
 Documentation/git-verify-pack.txt            |   23 ++
 Documentation/technical/collision-detect.txt |  385 ++++++++++++++++++++++++++
 Documentation/technical/pack-format.txt      |   40 +++
 5 files changed, 471 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/technical/collision-detect.txt

diff --git a/Documentation/git-count-objects.txt b/Documentation/git-count-objects.txt
index 23c80ce..4cdbaf5 100644
--- a/Documentation/git-count-objects.txt
+++ b/Documentation/git-count-objects.txt
@@ -8,7 +8,7 @@ git-count-objects - Count unpacked number of objects and their disk consumption
 SYNOPSIS
 --------
 [verse]
-'git count-objects' [-v]
+'git count-objects' [-v] [-M]
 
 DESCRIPTION
 -----------
@@ -25,6 +25,16 @@ OPTIONS
 	objects, number of packs, disk space consumed by those packs,
 	and number of objects that can be removed by running
 	`git prune-packed`.
+-M::
+--count-md::
+	Report the number of loose objects with no stored message digests.
+	With the -v option, the number of missing "mds" files (these
+	contain the message digests for the SHA1 hashes in the corresponding
+	"idx" files) is reported, along with a count of the number of
+	mds files whose size is wrong (e.g., an index was created but the
+	existing MDS file was not updated) and a count of the number of
+	objects in pack files that do not have a stored message digest.
+	Values that are zero are not shown.
 
 GIT
 ---
diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 909687f..3285fae 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -11,14 +11,14 @@ SYNOPSIS
 [verse]
 'git index-pack' [-v] [-o <index-file>] <pack-file>
 'git index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>]
-                 [<pack-file>]
+		 [-m <mds-file>] [<pack-file>]
 

 DESCRIPTION
 -----------
-Reads a packed archive (.pack) from the specified file, and
-builds a pack index file (.idx) for it.  The packed archive
-together with the pack index can then be placed in the
+Reads a packed archive (.pack) from the specified file, and builds a
+pack index file (.idx) and a pack mds file (.mds) for it.  The packed
+archive together with the pack index can then be placed in the
 objects/pack/ directory of a git repository.
 

@@ -35,6 +35,14 @@ OPTIONS
 	fails if the name of packed archive does not end
 	with .pack).
 
+-m <mds-file>::
+	Write the generated pack mds file into the specified.
+	file Without this option, the name of the pack mds
+	file is constructed from the name of packed archive
+	file by replacing .pack with .idx (and the program
+	fails if the name of packed archive does not end
+	with .pack).
+
 --stdin::
 	When this flag is provided, the pack is read from stdin
 	instead and a copy is then written to <pack-file>. If
diff --git a/Documentation/git-verify-pack.txt b/Documentation/git-verify-pack.txt
index cd23076..e81c514 100644
--- a/Documentation/git-verify-pack.txt
+++ b/Documentation/git-verify-pack.txt
@@ -33,6 +33,15 @@ OPTIONS
 	Do not verify the pack contents; only show the histogram of delta
 	chain length.  With `--verbose`, list of objects is also shown.
 
+-M::
+--show-mds::
+	Show the message digests along with the 40-character object names
+	(SHA1 value in hexidecimal). Ignored if --stat-only is set. If
+	--verbose is not set, only the table indexed by object names is
+	shown, although the files will be verified.  The message digests
+	printed are the actual ones - if the MDS file does not contain these,
+	the verification will fail.
+
 \--::
 	Do not interpret any more arguments as options.
 
@@ -48,6 +57,20 @@ for objects that are not deltified in the pack, and
 
 for objects that are deltified.
 
+When the -M option is used, the offset-in-pack field is followed by an
+entry giving the message digest.  The format used is:
+
+      md=0xHEX_VALUE
+
+when a message digest exists, and
+
+     <no md>
+
+when a message digest does not exist.  These entries precede the depth
+entry for deltified objects.  A non-existent message digest will be shown
+only if the MDS file is missing - while the MDS-file format allows missing
+entries, the file will not be considered valid.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/Documentation/technical/collision-detect.txt b/Documentation/technical/collision-detect.txt
new file mode 100644
index 0000000..0d33da8
--- /dev/null
+++ b/Documentation/technical/collision-detect.txt
@@ -0,0 +1,385 @@
+Fast Hash-Collision Detection
+=============================
+
+Initially Git used a SHA-1 hash as an object ID under the assumption
+that a hash collision would never occur in practice. While an
+accidental SHA-1 collision is extremely unlikely, it is possible,
+although very expensive, to generate multiple files with the same
+SHA-1 value in under 2^57 operations.  With computer performance
+increasing significantly from one year to the next, Git's assumptions
+about SHA-1 will eventually not hold in the case of a malicious
+attempt to damage a project.  One should note that just because the
+probability of a SHA-1 collision occurring accidentally is extremely
+low does not mean a priori that SHA-1 provides an adequate safety
+margin for preventing a malicious attempt to damage repositories and a
+discussion below outlines some of the issues regarding this
+possibility.
+
+While one could modify Git to use SHA-224, SHA-256, SHA-384, or
+SHA-512 instead of SHA-1, the change would have to support the
+original format as well (in order to deal with existing Git
+repositories). While one could convert an existing repository to use
+the new hash function, this would require rewriting every object,
+including trees and commits.  The outcome would be problematic given
+the existence of email and documentation that might name commits by
+their SHA-1 hashes. One should note that Git performs a byte-by-byte
+check for hash collisions when a pack file is indexed.  Unfortunately,
+during fetch or pull operations, Git tries to avoid copying objects
+when a peer already has a copy, and this is determined solely on the
+basis of SHA-1 hashes.
+
+The following describes a modification to Git's initial design that is
+(a) relatively easy to implement, (b) is compatible with and can
+interoperate with older versions of Git (both the program and the
+repositories) (c) has a small computational overhead, and (d)
+increases security substantially, with a goal of detecting hash
+collisions early and automatically.  Because the implementation is
+relatively simple and the overhead very low, it makes sense to
+incorporate this change (or some alternative) before the security
+issue becomes a serious problem.
+
+Although Git generally uses that assumption that there will never be a
+hash collision using SHA-1 in practice, under some circumstances, Git
+will detect collisions via a byte-by-byte comparison as objects are
+added to the repository or as pack files are indexed.  This test is
+performed when an index is built (via the Git pack-index command), but
+a byte-by-byte comparison was deemed too computationally expensive to
+use in all circumstances: with pack files in particular, simply
+extracting an object can require not only decompressing it, but
+handling a series of delta encodings.
+
+Collision detection has been extended by computing a message digest or
+CRC of the object's contents (i.e., excluding the Git header). These
+message digests are stored separately from Git objects and are used
+for an independent collision test - looking up the message digests or
+CRCs using the SHA-1 IDs as a key can be done quickly, and comparing
+them is fast as well (a single unsigned-integer comparison for a
+32-bit CRC).  Assuming statistical independence in the CRC case, the
+changes of an undetected SHA-1 collision, should one occur, is 1 in
+2^32.  This extension is computationally cheap (timing the Git test
+suite (run via 'make test') showed only a small increase in running
+time and the extension is backwards compatible with existing Git
+repositories - if a CRC is not available for a SHA-1 value, the
+implementation reverts to its former behavior and simply compares
+SHA-1 values.  The CRC can, of course, be easily replaced with a
+SHA-256 or SHA-512 digest to reduce the chances of an undetected SHA-1
+collision to nearly zero.  For that reason, in the following we will
+use the terms MD (Message Digest) and CRC interchangeably.
+
+The implementation creates a directory in .git/objects named "crcs",
+which contains sub-directories and file names identical to the
+sub-directories in objects used to store loose objects: a two
+character directory name, with a 38-character file name, the
+concatenation of which gives the SHA-1 hash for the object.  the files
+in sub-directories of "crcs", however, simply contain 32-bit CRCs
+stored in network byte order.  In addition, for each pack file
+(.../objects/pack/FILE.pack), there is a corresponding file named
+.../objects/pack/FILE.mds in addition to .../objects/pack/FILE.idx.
+The MDS file contains the CRCs, stored in the same order as the SHA-1
+hashes in .../objects/pack/FILE.idx.  The format of the MDS file is
+described in pack-format.txt.
+
+Thus, the directory structure (only part of it is shown) is as
+follows:
+
+ .git---.
+	|
+	|-objects-.
+	|	  |--XX--.
+	.	  |	 |--XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+	.	  |	 .
+	.	  |	 .
+		  |	 .
+		  .
+		  .
+		  .
+		  |-crcs-.
+		  |	 |--XX--.
+		  |	 |	|--XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+		  |	 .	.
+		  |	 .	.
+		  |	 .	.
+		  |
+		  |-pack-.
+		  |	 |--YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY.pack
+		  |	 |--YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY.idx
+		  |	 |--YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY.mds
+		  |	 .
+		  |	 .
+		  |	 .
+		  |
+		  `-info-.
+			 .
+			 .
+
+The mds files are relatively short - an average of 5 bytes per CRC
+plus some fixed overhead due to a header and trailer, with the CRCs
+listed in the same order as the SHA-1 values in the matching idx file
+(a function named nth_packed_object_objcrc32 has the same signature as
+the previously-defined function nth_packed_object_offset, so the
+procedure to look up the MD/CRC value from a pack file is the same).
+
+For fetch and push operations, the commands fetch-pack, send-pack,
+receive-pack, and upload-pack were modified so that various object IDs
+can have any one of the following formats, with each number
+represented in hexadecimal:
+
+		SHA1
+		SHA1-MD
+		SHA1-MD-BlobMD
+
+where SHA1 is the SHA-1 hash of a commit, MD is the 32-bit CRC of the
+commit (uncompressed, not including the Git object header), and BlobMD
+is a CRC of the CRCs for each blob found by traversing the commit's
+tree using a depth-first search, with blobs processed in the order
+they appear in the trees (this is a relatively fast operation because
+the CRCs of each blob are stored in the repository).
+
+Both receive-pack and upload-pack send a capability named "mds-check"
+to allow the two longer object IDs.  When the CRCs are available, the
+longer formats are used, but are generated only by fetch-pack and
+send-pack: because of backwards-compatibility constraints,
+receive-pack and upload-pack cannot determine the capabilities of
+fetch-pack and send-pack when connected to a remote repository).
+While the message digests associated with each object are computed
+once and stored, the "BlobMD" ones are computed each time - the
+"BlobMD" ones are only used by the Git commands fetch-pack,
+upload-pack, send-pack, and receive-pack.  In each case, a
+hash-collision check is performed only if the message digest used is
+available.  The collision checks during a fetch, push, or pull command
+are done by receive-pack and upload pack because send-pack and
+fetch-pack do not receive their peers' MD and BlobMD values.
+
+
+Implementation Details
+----------------------
+
+Functions to manipulate the message digest/CRC database are declared
+in the file crcdb.h.  The implementation as described above is in the
+file objd-crcdb.c: it is thus easy to change the implementation of how
+these objects are stored with minimal impact on the rest of the source
+code.
+
+In pack-write.c, there is a new function named write_mds_file with the
+same function signature as write_idx_file.  Both are called in pairs
+(write_idx_file first) so that the idx file and mds file for the
+corresponding pack file will always be created.
+
+In commit.c, there is a new function that recursively traverses the
+tree associated with a commit and finds the "blob" entries and looks
+up those entries' message digests in order to compute a message digest
+of these message digests (which is faster than computing a message
+digest of all the bytes in the blobs associated with a commit).
+
+Various function names signatures in sha1_file.c were changed to take
+two additional arguments, the first a pointer to an int used as a flag
+to indicate whether a MD/CRC exists, and the second a pointer to a
+uint32_t containing the MD/CRC (a CRC currently).  For backwards
+compatibility with previously existing functions, those functions had
+there names changed by adding "_extended" to them, with macros in
+cache.h defined so that existing code that does not need to obtain a
+MD/CRC would not be changed. There are a few additional functions
+added to sha1_file.c such as one to determine if there is an MD/CRC
+for a given SHA-1 value. Many changes in the rest of Git that result
+from this simply change the arguments to these functions.  As a
+convention, most such arguments use names like objcrc32, objcrc32p,
+has_objcrc32 and has_objcrc32p in order to make it easy to find areas
+of the code implementing hash-collision detection using the git-grep
+command.
+
+A few data structures (notably struct pack_idx_entry and struct
+packed_git) contain fields used to store has_objcrc32 and objcrc32
+values or data associated with MDS files.  These are used while
+building new MDS files.
+
+Some of the Git commands (count-objects, index-pack, and verify-pack)
+have additional command-line options related to the MD/CRCs and mds
+files. This makes it possible to explicitly name an mds file being
+created and to request that various listings show both the MD/CRC
+values in addition to SHA-1 hashes (the MD/CRC values are not listed
+by default in case user-defined scripts assume the current behavior).
+
+For C files, changes were made to the following files (compared to
+commit 017d1e13) for the initial collision-detection implementation:
+
+       * builtin/count-objects.c
+       * builtin/fetch-pack.c
+       * builtin/index-pack.c
+       * builtin/init-db.c
+       * builtin/pack-objects.c
+       * builtin/pack-redundant.c
+       * builtin/prune-packed.c
+       * builtin/prune.c
+       * builtin/receive-pack.c
+       * builtin/send-pack.c
+       * builtin/verify-pack.c
+       * commit.c
+       * environment.c
+       * fast-import.c
+       * gdbm-packdb.c (new file)
+       * git.c
+       * hex.c
+       * http.c
+       * objd-crcdb.c (new file)
+       * pack-write.c
+       * sha1_file.c
+       * upload-pack.c
+
+The other files had changes that reflected changes to function
+signatures.
+
+The header files that were modified are
+
+    * cache.h
+    * commit.h
+    * crcdb.h (new file)
+    * pack.h
+    * packdb.h (new file)
+
+where the changes are mostly new function declarations, a few macros
+for backwards-compatibility, and a few additional fields in some
+data structures.
+
+Minor changes were made to the test suite: t0000-basic.sh,
+t5300-pack-object.sh, t5304-prune.sh, t5500-fetch-pack.sh, and
+t5510-fetch.sh.
+
+The packdb functions are conditionally compiled and by default are not
+used.  When used, these use GDBM to store CRCs for SHA-1 hashes in
+cases in which the hash was not available - in this case the hash will
+be recomputed and stored for future use.  Testing indicates that
+packdb is not needed. It may be worth turning on during debugging to
+verify if a problem is discovered involving a missing MD/CRC. (As an
+aside, the packdb code is based on a test to see if GDBM would be
+efficient enough to store the MD/CRC values in general, thus avoiding
+the need to create "mds" files and reducing the number of files in the
+"crcs" directory, but it turned out that performance was not
+acceptable.)
+
+Security-Issue Details
+----------------------
+
+Without hash-collision detection, Git has a higher risk of data
+corruption due to the obvious hash-collision vulnerability, so the
+issue is really whether a usable vulnerability exists. Recent research
+has shown that SHA-1 collisions can be found in 2^63 operations or
+less.  While one result claimed 2^53 operations, the paper claiming
+that value was withdrawn from publication due to an error in the
+estimate. Another result claimed a complexity of between 2^51 and 2^57
+operations, and still another claimed a complexity of 2^57.5 SHA-1
+computations. A summary is available at
+<http://hackipedia.org/Checksums/SHA/html/SHA-1.htm#SHA-1>. Given the
+number of recent attacks, possibly by governments or large-scale
+criminal enterprises
+(<http://www.csmonitor.com/World/terrorism-security/2011/0906/Iranian-government-may-be-behind-hack-of-Dutch-security-firm>,
+<http://en.wikipedia.org/wiki/Operation_Aurora>,
+<http://en.wikipedia.org/wiki/Botnet#Historical_list_of_botnets>),
+which include botnets with an estimated 30 million computers, there is
+reason for some concern: while generating a SHA-1 collision for
+purposes of damaging a Git repository is extremely expensive
+computationally, it is possibly within reach of very well funded
+organizations. 2^32 operations, even if the operations are as
+expensive as computing a SHA-1 hash of a modest source-code file, can
+be performed in a reasonably short period of time on the type of
+hardware widely used in desktop or laptop computers at present. With
+sufficient parallelism, 30 million personal computers sufficient for
+playing the latest video games could perform 2^56 operations in a
+reasonable time.
+
+The security implications depend on how Git is used.  In the simplest
+case in which a single, shared repository is used by a number of
+developers, with source code only shared though this repository, the
+problems are minimal - Git will not allow one to insert an object
+whose SHA-1 hash matches that of an existing object.  Since an
+attacker would not know the SHA-1 hash until the correct object is
+already in the shared repository, all an attacker would succeed in
+doing is to create some confusion in his/her private repository and
+working copy (but note that some of the assumptions break down if
+developers email source files between themselves rather than transferring
+everything via a Git repository).
+developers, with source code only shared though this repository, the
+problems are minimal - Git will not allow one to insert an object
+whose SHA-1 hash matches that of an existing object.  Since an
+attacker would not know the SHA-1 hash until the correct object is
+already in the shared repository, all an attacker would succeed in
+doing is to create some confusion in his/her private repository and
+working copy (but note that some of the assumptions break down if
+developers email source files between themselves rather than transferring
+everything via a Git repository).
+
+In other cases, however, there could be an issue if a SHA-1 collision
+can be created quickly enough. As an example, suppose one is using the
+"Integration-manager workflow" described in "Pro Git" (see
+<http://progit.org/book/ch5-1.html>) with a "blessed repository" BR
+and two public developer repositories DPR1 and DPR2.  Suppose a
+developer puts a legitimate change to the code into DPR2. Another
+developer with read access to DPR2 and write access to DPR1
+immediately fetches this commit, and replaces one file with a modified
+version of the same size that has the same SHA-1 value, and then puts
+the commit into DPR1 (the change may be a trivial one designed to
+introduce an obscure buffer overflow error that can be exploited and
+may not stand out in a code review).  At this point DPR1 and DPR2 have
+identical commits (i.e., the same commit object) but with one file
+modified.  No local test on either DPR1 or DPR2 will uncover a
+discrepancy based on SHA-1 values, object contents, and digital
+signatures for commits.  Further assume that both developers send an
+email to the "integration manager" notifying him/her of the changes.
+
+Now suppose the above changes occurred late on a Friday afternoon and
+over a weekend.  On the next Monday morning, the "integration manager"
+reads the emails and pulls changes from DPR1 and then DPR2.  Once a
+'fetch' from DPR1 is complete, a subsequent fetch from DBR2 will not
+transfer any data related to the commit, as the integration manager's
+repository already has a commit with the matching SHA-1 value.  A
+modified copy will have been introduced into the system, and the
+developers using DPR2 may never notice the difference - as they will
+pull from DPR2 more often than BR, they will most likely have the
+correct files and the commit in their local repositories and the
+transfer protocol will avoid sending the commit if it is already
+available.  Furthermore, if the modified file introduces a hard-to-find
+buffer overflow testing may not uncover the problem.
+
+If the file in question is modified again, and a thin pack is used in
+a fetch so that the change is delta-encoded, the SHA-1 hashes may
+differ after de-deltafication, but such a change might not be
+introduced before a release.  While obtaining a different SHA-1
+hash than expected would be detected, the error would be a corrupted
+repository - a missing SHA-1 value in the tree associated with a commit.
+It would take some effort to figure out what went wrong - what
+happens depends on the state of both the client and server when a
+git-fetch operation runs.
+
+As a justification for the scenarios just described, one can use MD5
+as a model: http://www.mscs.dal.ca/~selinger/md5collision/ gives the
+following two 128-byte sequences (expressed in hexadecimal) as ones
+with the same MD5 hash:
+
+d131dd02c5e6eec4693d9a0698aff95c 2fcab58712467eab4004583eb8fb7f89
+55ad340609f4b30283e488832571415a 085125e8f7cdc99fd91dbdf280373c5b
+d8823e3156348f5bae6dacd436c919c6 dd53e2b487da03fd02396306d248cda0
+e99f33420f577ee8ce54b67080a80d1e c69821bcb6a8839396f9652b6ff72a70
+
+and
+
+d131dd02c5e6eec4693d9a0698aff95c 2fcab50712467eab4004583eb8fb7f89
+55ad340609f4b30283e4888325f1415a 085125e8f7cdc99fd91dbd7280373c5b
+d8823e3156348f5bae6dacd436c919c6 dd53e23487da03fd02396306d248cda0
+e99f33420f577ee8ce54b67080280d1e c69821bcb6a8839396f965ab6ff72a70
+
+These can be used for test purposes.  If you change the first 16 bytes
+of both sequences to the same new value, before the point where the
+files differ, you will get different MD5 hash values.  If you append
+the same text to both files, the MD5 values of the modified files are
+equal (and, of course, different from the previous value) but
+fortunately Git includes the length of an object's contents in the
+object's header, and the SHA-1 hash includes the header).  In most
+cases, whever a file is modified during software development, the
+file's length will change.  This causes a change early enough in the
+object so that applying the same patch to two different files that
+have the same SHA-1 value will typically result in two files with
+different SHA-1 values.  So, the result of a hash collision when
+multiple remote repositories are used would initially be different
+versions of the same file for the same commit in different
+repositories, possibly followed by some of the repositories being
+corrupted as one of these files is modified, depending on the state
+of each server and client when a fetch or pull operation is run.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 1803e64..4dfaf92 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -158,3 +158,43 @@ Pack file entry: <+
     corresponding packfile.
 
     20-byte SHA1-checksum of all of the above.
+
+
+= pack-*.mds files contain message digests for objects (CRCs
+  initially, with other options possibly added at a later date such as
+  SHA-256).  The digests are stored in the same order as the sha-1 values
+  in the matching idx file.  These files have the following format:
+
+  - A 6-byte magic number consisting the the characters "PKMDS" followed
+    by a NULL character (0).
+
+  - A one-byte version number (= 1)
+
+  - A one-byte field-length value for message digest fields, in units of
+    4-byte words, with the legal value being 1. The length of the message
+    digest fields in bytes is denoted as wsize below).
+
+  - A set of blocks, each of which contains 4 entries encoded as follows:
+
+      * four one-byte fields, one per entry, for which a zero value
+	indicates that a matching entry does not exist and for which a
+	value of 1 indicates that the field contains a 32-bit CRC stored
+	in network byte order.
+
+      * 4 wsize-byte fields, one per entry, each containing a CRC
+	(by conventionwhich should be 0 if the CRC does not exist).
+	For each field, the data it contains should start at the first
+	byte, padded with NULL characters if the field is longer than
+	the digest it stores.
+
+    For the set of all blocks, the nth one-byte field and the nth 4-byte
+    field store the values for the nth entry in the file. The format
+    ensures that each message digest starts on a 32-bit boundary,
+    allowing 32-bit integer operations to be used in copying or
+    comparing values.
+
+  - A 20 byte SHA-1 hash of the SHA-1 hashes naming the objects whose
+    message digests are being stored, in the same order as they
+    appear in the corresponding idx file.
+
+  - A 20 byte SHA-1 hash of all of the above.
-- 
1.7.1

^ permalink raw reply related

* [PATCH 1/3] Add CRCDB and PACKDB modules for fast collision detection
From: Bill Zaumen @ 2011-11-30  5:59 UTC (permalink / raw)
  To: git; +Cc: gitster

The CRCDB module maintains a persistent mapping from SHA-1 hashes
to CRCs or message digests for Git objects. The current implementation
uses one file per CRC.  Documentation is in the header file crcdb.h
and there is a preprocessor directive CRCDB that should be set to 0
or 1, with the current choice being 0.

The PACKDB module (normally not turned on but can be conditionally
compiled) can be turned on for debugging/testing. This module
allows a CRC for an object to always be found, computing it from
scratch and storing it in a GDBM database.  It is intended for
use while building index files.  Testing seems to show that it is
not necessary as the needed information is always there.

Signed-off-by: Bill Zaumen <bill.zaumen+git@gmail.com>
---
 crcdb.h       |  191 +++++++++++++++++++++++++++++++++
 gdbm-packdb.c |  247 +++++++++++++++++++++++++++++++++++++++++++
 objd-crcdb.c  |  324 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 packdb.h      |  107 +++++++++++++++++++
 4 files changed, 869 insertions(+), 0 deletions(-)
 create mode 100644 crcdb.h
 create mode 100644 gdbm-packdb.c
 create mode 100644 objd-crcdb.c
 create mode 100644 packdb.h

diff --git a/crcdb.h b/crcdb.h
new file mode 100644
index 0000000..6eabb4f
--- /dev/null
+++ b/crcdb.h
@@ -0,0 +1,191 @@
+#ifndef CRCDB_H
+#define CRCDB_H
+
+/**
+ * CRC Database Support.
+ *
+ * This module uses GDBM to maintain a database mapping SHA-1 object keys
+ * to a 32-bit CRC for purposes of detecting hash collisions.  The CRCs
+ * are stored in the database in network byte order (i.e., as big-endian
+ * 32-bit unsigned integers).  The functions allow for initialization,
+ * queries, adding new entries (with a collision check), and managing
+ * access to alternate databases.
+ *
+ * The preprocessor symbol CRCDB determines the implementation of the
+ * module.
+ * Values:
+ *   0, 1 - implement using directories and files - the first byte of a
+ *       SHA1 hash determines a subdirectory of ../objects/crcs, and
+ *       the remaining bytes determine the file name, with the names
+ *       consisting of the hexadecimal representation of each byte's
+n *       value. The files then contain 32-bit CRCs stored in network
+ *       byte order.  A large number of 4-byte files is a poor use of
+ *       disk space, but may be useful for testing.  A value of 1 implies
+ *       that packdb will also be used.
+ */
+
+#include<stdint.h>
+
+#include "cache.h"
+
+#if (CRCDB == 0) || (CRCDB == 1)
+/**
+ * Opaque data type - because the typedef is for a pointer, we
+ * don't need the structure defined in files that use the pointer.
+ * We do need it defined somewhere, in this case in the file
+ * objd-crcdb.c, which is the only place the fields are used.
+ */
+typedef struct objd_crcdb *crcdb_t;
+#endif
+
+/**
+ *  Initialize the database.
+ *  This opens a database file in the objects directory named crcs,
+ *  used to store CRCS of objects (uncompressed, excluding the header)
+ *  for hash-collision detection.
+ */
+extern void crcdb_init(void);
+
+/**
+ * Check if the database has been initialized.
+ * Returns:
+ *   1 if crcdb_init has been called; false otherwise.
+ */
+extern int crcdb_initialized(void);
+
+/**
+ * Initializes alternative databases by adding them to a table with
+ * these databases closed.
+ */
+extern void crcdb_init_alts();
+
+
+/**
+ * Open a database file.
+ *
+ * The default database can be read or written. alternate database
+ * files are read-only databases.  Multiple calls without intervening
+ * calls to crcdb_close for a given argument will result in the same
+ * object being returned each successive time.  The pathname must match
+ * one stored by a call to crcdb_init_alts.
+ *
+ * Arguments:
+ *    pathname - the pathname of the file; NULL for the default db;
+ *
+ * Returns:
+ *    the database (NULL indicates the default)
+ */
+extern crcdb_t crcdb_open(char *pathname);
+
+/**
+ * Open a database file given an alterate object database pointer.
+ *
+ * The default database can be read or written. alternate database
+ * files are read-only databases.  Multiple calls without intervening
+ * calls to crcdb_close for a given argument will result in the same
+ * object being returned each successive time The argument must match
+ * an alternate object database pointer stored by a precding call to
+ * crcdb_init_alts.
+ *
+ * Arguments:
+ *    alt - an alternate object database pointer (which provides the
+ *          pathname).
+ *
+ * Returns:
+ *    the database (NULL indicates the default)
+ */
+extern crcdb_t crcdb_open_alt(struct alternate_object_database *alt);
+
+/**
+ * Lookup a CRC from a database.
+ *
+ * Arguments:
+ *        dbf - the CRC database; NULL for the default database
+ *       sha1 - the key for the lookup (a 20-byte SHA1 digest)
+ *  objcrc32p - a pointer to a uint32_t to store the returned value when
+ *              an entry in the database exists.
+ *
+ * Returns:
+ *   0 if no entry, 1 if there is an existing entry.
+ */
+extern int crcdb_lookup(crcdb_t dbf, const unsigned char *sha1,
+			uint32_t *objcrc32p);
+
+/**
+ * Remove a CRC from a database.
+ *
+ * Arguments:
+ *        dbf - the CRC database; NULL for the default database
+ *       sha1 - the key for the lookup (a 20-byte SHA1 digest)
+ *
+ * Returns:
+ *   0 on success; -1 if the entry did not exist or if an entry
+ *   could not be deleted
+ */
+extern int crcdb_remove(crcdb_t dbf, const unsigned char *sha1);
+
+/**
+ * Process a CRC for a SHA-1 key.
+ *
+ * Arguments:
+ *        dbf - the CRC database; NULL for the default database
+ *       sha1 - the key for the lookup (a 20-byte SHA1 digest)
+ *   objcrc32 - the crc to store.
+ *
+ * Returns:
+ *   0 if this is a new entry; 1 if it is an existing entry, -1 if
+ *   an entry cannot be added ot the database.
+ *
+ * Errors:
+ *   Will call 'die' and exit if there is a hash collision. Will call
+ *   'error' if the value cannot be entered.
+ */
+extern int crcdb_process(crcdb_t dbf, const unsigned char *sha1,
+			 uint32_t objcrc32);
+
+/**
+ * Reorganize a CRC database.
+ *
+ * Arguments:
+ *        dbf - the CRC database; NULL for the default database
+ * Returns:
+ *   0 on success; -1 on failure
+ */
+extern int crcdb_reorganize(crcdb_t dbf);
+
+
+/**
+ * Close a  database file.
+ *
+ * If the same database was opened multiple times, a reference count is
+ * decremented and the the database will not be closed until the count
+ * reaches zero.  Calls to crcdb_open or crcdb_open_alt must be balanced
+ * by calls to crcdb_close or crcdb_close_alt.
+ *
+ * Arguments:
+ *        dbf - the CRC database.
+ */
+extern void crcdb_close(crcdb_t dbf);
+
+/**
+ * Close a database file given an alternate object database pointer.
+ *
+ * If the same database was opened multiple times, a reference count is
+ * decremented and the the database will not be closed until the count
+ * reaches zero.  Calls to crcdb_open or crcdb_open_alt must be balanced
+ * by calls to crcdb_close or crcdb_close_alt.
+ *
+ * Arguments:
+ *       alt - a pointer ot an alternate object database
+ */
+extern void crcdb_close_alt(struct alternate_object_database *alt);
+
+/**
+ * Shutdown the database files.
+ * This will shut down the default database and the cached alternative
+ * databases.  All others should be closed by calling crcb_alt_close
+ * explicitly
+ */
+extern void crcdb_finish(void);
+
+#endif /*CRCDB_H */
diff --git a/gdbm-packdb.c b/gdbm-packdb.c
new file mode 100644
index 0000000..0115f87
--- /dev/null
+++ b/gdbm-packdb.c
@@ -0,0 +1,247 @@
+#include<sys/types.h>
+#include<sys/stat.h>
+#include <sys/param.h>
+#include<stdio.h>
+#include<string.h>
+#include<malloc.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <time.h>
+#include <pthread.h>
+#include <errno.h>
+#include <gdbm.h>
+
+#include "cache.h"
+#include "packdb.h"
+#include "crcdb.h"
+
+static void nsleep() {
+#if _POSIX_C_SOURCE >= 199309L
+	struct timespec ts;
+	ts.tv_sec = 0;
+	ts.tv_nsec = 100000;
+	nanosleep(&ts, NULL);
+#else
+	sleep(1);
+#endif
+}
+
+
+static int initialized = 0;
+
+static GDBM_FILE dbf = NULL;
+char *dbf_name;
+static int dbf_depth = 0;
+
+pthread_mutex_t gdbm_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+static void packdb_close_nolock(void);
+
+void packdb_init(void) {
+	char *last;
+	pthread_mutex_lock(&gdbm_mutex);
+	if (initialized) {
+		pthread_mutex_unlock(&gdbm_mutex);
+		return;
+	}
+	dbf_name = get_object_packdb_node();
+	last = rindex(dbf_name, '/');
+	*last = 0;
+	if (!access(dbf_name, R_OK|W_OK|X_OK)) {
+		initialized = 1;
+	}
+	*last = '/';
+	pthread_mutex_unlock(&gdbm_mutex);
+}
+
+int packdb_initialized(void) {
+  return initialized;
+}
+
+static void packdb_open_nolock(void) {
+	if (dbf_depth == 0) {
+	AGAIN_W:
+		dbf = gdbm_open(dbf_name, 0, GDBM_WRCREAT, PERM_GROUP, NULL);
+		if (dbf == NULL && gdbm_errno == GDBM_CANT_BE_WRITER) {
+			nsleep();
+			goto AGAIN_W;
+		}
+	}
+	dbf_depth++;
+}
+
+void packdb_open(void) {
+	pthread_mutex_lock(&gdbm_mutex);
+	packdb_open_nolock();
+	pthread_mutex_unlock(&gdbm_mutex);
+}
+
+
+int packdb_lookup(const unsigned char *sha1, uint32_t *objcrc32p) {
+	datum key;
+	datum ovalue;
+	uint32_t oldcrc;
+	pthread_mutex_lock(&gdbm_mutex);
+
+	if (!initialized) {
+		pthread_mutex_unlock(&gdbm_mutex);
+		return -1;
+	}
+
+	key.dptr = (char *)sha1;
+	key.dsize = 20;
+
+	packdb_open_nolock();
+	if (dbf == NULL) {
+		packdb_close_nolock();
+		pthread_mutex_unlock(&gdbm_mutex);
+		return -1;
+	}
+	ovalue = gdbm_fetch(dbf, key);
+	packdb_close_nolock();
+	pthread_mutex_unlock(&gdbm_mutex);
+
+	if (ovalue.dptr == NULL) return 0;
+	oldcrc = *(uint32_t *)(ovalue.dptr);
+	free(ovalue.dptr);
+	if (objcrc32p) *objcrc32p = (oldcrc);
+	return 1;
+}
+
+int packdb_remove(const unsigned char *sha1) {
+	datum key;
+	int result;
+	pthread_mutex_lock(&gdbm_mutex);
+	if ((!initialized)  || dbf == NULL) {
+		pthread_mutex_unlock(&gdbm_mutex);
+		return -1;
+	}
+	key.dptr = (char *)sha1;
+	key.dsize = 20;
+	packdb_open_nolock();
+	result = gdbm_delete(dbf, key);
+	packdb_close_nolock();
+	pthread_mutex_unlock(&gdbm_mutex);
+	return result;
+}
+
+
+int packdb_process(const unsigned char *sha1, uint32_t objcrc32) {
+	datum key;
+	datum nvalue;
+	datum ovalue;
+	uint32_t newcrc = (objcrc32);
+	uint32_t oldcrc;
+	pthread_mutex_lock(&gdbm_mutex);
+	if ((!initialized) || dbf == NULL) {
+		pthread_mutex_unlock(&gdbm_mutex);
+		return -1;
+	}
+	key.dptr = (char *)sha1;
+	key.dsize = 20;
+
+	nvalue.dptr = (char *)&newcrc;
+	nvalue.dsize = sizeof(uint32_t);
+
+	packdb_open_nolock();
+	ovalue = gdbm_fetch(dbf, key);
+	if (dbf == dbf && ovalue.dptr == NULL) {
+		int status;
+		status = gdbm_store(dbf, key, nvalue, GDBM_INSERT);
+		packdb_close_nolock();
+		pthread_mutex_unlock(&gdbm_mutex);
+		switch (status) {
+		case 0:
+			return 0;
+		case -1:
+		  error("could not enter crc into database - key = %s",
+		      sha1_to_hex(sha1));
+		      return -1;
+		case 1:
+			return 1;
+		}
+		return -1;	/* should not occur */
+	} else if (ovalue.dptr == NULL) {
+		packdb_close_nolock();
+		pthread_mutex_unlock(&gdbm_mutex);
+		return 0;
+	} else {
+		packdb_close_nolock();
+		pthread_mutex_unlock(&gdbm_mutex);
+
+		oldcrc = *(uint32_t *)ovalue.dptr;
+		free(ovalue.dptr);
+		/*
+		 * Both oldcrc and newcrc are in network byte order.
+		 */
+		if (oldcrc != newcrc) {
+			die("SHA1  COLLISION WHEN INSERTING OBJECT %s",
+			    sha1_to_hex(sha1));
+			return -1;
+		}
+		return 1;
+	}
+}
+
+int packdb_store(unsigned char *sha1) {
+	int status;
+	uint32_t objcrc32;
+	status = crcdb_lookup(NULL, sha1, &objcrc32);
+	if (status == 1) {
+		return packdb_process(sha1, objcrc32);
+	} else if (status == 0) {
+	  return packdb_lookup(sha1, &objcrc32)? 1: -1;
+	} else {
+	  return -1;
+	}
+}
+
+int packdb_reorganize() {
+	int status;
+	pthread_mutex_lock(&gdbm_mutex);
+	if ((!initialized)  || dbf == NULL) {
+		pthread_mutex_unlock(&gdbm_mutex);
+		return -1;
+	}
+	packdb_open_nolock();
+	status = gdbm_reorganize(dbf);
+	packdb_close_nolock();
+	pthread_mutex_unlock(&gdbm_mutex);
+	return status;
+}
+
+
+static void packdb_close_nolock(void) {
+	  if (!initialized) {
+		return;
+	  }
+	  dbf_depth--;
+	  if (dbf_depth == 0 && dbf != NULL) {
+		gdbm_close(dbf);
+		dbf = NULL;
+	  }
+	  if (dbf_depth < 0) {
+		die("packdb dbf_depth %d < 0", dbf_depth);
+	  }
+	  return;
+}
+
+void packdb_close(void) {
+	  pthread_mutex_lock(&gdbm_mutex);
+	  packdb_close_nolock();
+	  pthread_mutex_unlock(&gdbm_mutex);
+}
+
+void packdb_finish(void) {
+	pthread_mutex_lock(&gdbm_mutex);
+	if (!initialized) {
+		pthread_mutex_unlock(&gdbm_mutex);
+		return;
+	}
+	if (dbf != NULL) gdbm_close(dbf);
+	dbf = NULL;
+	dbf_depth = 0;
+	initialized = 0;
+	pthread_mutex_unlock(&gdbm_mutex);
+}
diff --git a/objd-crcdb.c b/objd-crcdb.c
new file mode 100644
index 0000000..2bf6fd9
--- /dev/null
+++ b/objd-crcdb.c
@@ -0,0 +1,324 @@
+#include<sys/types.h>
+#include "cache.h"
+#include "crcdb.h"
+
+struct objd_crcdb {
+  char *root;
+};
+
+static struct objd_crcdb db;
+
+static crcdb_t no_dbf = (crcdb_t) 4;
+
+static crcdb_t dbf = NULL;
+
+#define ALT_DBF_LIMIT  512
+
+
+struct alt_map {
+	struct objd_crcdb db;
+	struct alternate_object_database *alt;
+	struct alt_map *refer;
+};
+
+struct alt_map alt_map[ALT_DBF_LIMIT];
+static int alt_in_use = 0;
+static int initialized = 0;
+
+
+void crcdb_init(void) {
+	if (initialized) {
+		return;
+	}
+	dbf = &db;
+	db.root = get_object_crc_node();
+	initialized = 1;
+}
+
+int crcdb_initialized(void) {
+	return initialized;
+}
+
+static int setup_alt(struct alternate_object_database *alt, void *param) {
+	static char buffer[PATH_MAX];
+	int i;
+	int lim = alt->name - alt->base;
+	memcpy(buffer, alt->base, lim);
+	memcpy(buffer, alt->base, lim);
+	memcpy(buffer+lim, "crcs", 4);
+	buffer[lim+4] = 0;
+	for (i = 0; i < alt_in_use; i++) {
+		if (alt_map[i].alt == alt) {
+			/* don't put in the same entry twice */
+			return 0;
+		}
+		if (strcmp(buffer, alt_map[i].db.root) == 0) {
+			break;
+		}
+	}
+	alt_map[alt_in_use].db.root = xstrdup(buffer);
+	alt_map[alt_in_use].alt = alt;
+	if (i < alt_in_use) {
+		alt_map[alt_in_use].refer = alt_map + i;
+	} else {
+		alt_map[alt_in_use].refer = NULL;
+	}
+	alt_in_use++;
+	return 0;
+}
+
+static int alt_initialized = 0;
+
+void crcdb_init_alts(void){
+	if (alt_initialized) return;
+	foreach_alt_odb(setup_alt, NULL);
+	alt_initialized = 1;
+}
+
+
+crcdb_t crcdb_open(char *name) {
+	int i;
+	if (name == NULL) return NULL;
+	for (i = 0; i < alt_in_use; i++) {
+		if (strcmp(alt_map[i].db.root, name) == 0) {
+			if (alt_map[i].refer) {
+				i = (alt_map[i].refer - alt_map);
+			}
+			return (crcdb_t)&(alt_map[i].db);
+		}
+	}
+	return no_dbf;
+}
+
+crcdb_t crcdb_open_alt(struct alternate_object_database *alt) {
+	int i;
+	for (i = 0; i < alt_in_use; i++) {
+		if (alt_map[i].alt == alt) {
+			return (crcdb_t)&(alt_map[i].db);
+		}
+	}
+	return no_dbf;
+
+}
+/* copied from sha1_file.c */
+static void fill_sha1_path(char *pathbuf, const unsigned char *sha1)
+{
+	int i;
+	for (i = 0; i < 20; i++) {
+		static char hex[] = "0123456789abcdef";
+		unsigned int val = sha1[i];
+		char *pos = pathbuf + i*2 + (i > 0);
+		*pos++ = hex[val >> 4];
+		*pos = hex[val & 0xf];
+	}
+}
+
+/*
+ * Warning: returns a static buffer so be careful about threading.
+ */
+static char *crc32_file_name(const char *path, const unsigned char *sha1)
+{
+	static char buf[PATH_MAX];
+	const char *objcrcdir;
+	int len;
+
+	objcrcdir = path;
+	len = strlen(objcrcdir);
+
+	/* '/' + sha1(2) + '/' + sha1(38) + '\0' */
+	if (len + 43 > PATH_MAX)
+		die("insanely long object crc directory %s", objcrcdir);
+	memcpy(buf, objcrcdir, len);
+	buf[len] = '/';
+	buf[len+3] = '/';
+	buf[len+42] = '\0';
+	fill_sha1_path(buf + len + 1, sha1);
+	return buf;
+}
+
+static int crcdb_lookup_aux(char *path, uint32_t *objcrc32p)
+{
+	if (!access(path, F_OK)) {
+		if (objcrc32p) {
+			int fd = open(path, O_RDONLY);
+			if (fd < 0) {
+				return 0;
+			}
+			if(read_in_full(fd, objcrc32p, sizeof(uint32_t))
+			   != sizeof (uint32_t)) {
+				close(fd);
+				return 0;
+			}
+			close(fd);
+			*objcrc32p = (*objcrc32p);
+		}
+		return 1;
+	} else {
+		return 0;
+	}
+}
+
+
+int crcdb_lookup(crcdb_t gdbf, const unsigned char *sha1, uint32_t *objcrc32p) {
+	char *path;
+
+	if (!initialized || gdbf == no_dbf) {
+	  return -1;
+	}
+	if (gdbf == NULL) gdbf = dbf;
+
+	path = crc32_file_name(gdbf->root, sha1);
+	return crcdb_lookup_aux(path, objcrc32p);
+}
+
+int crcdb_remove(crcdb_t gdbf, const unsigned char *sha1) {
+	char *path;
+	if (!initialized || gdbf == no_dbf) {
+	  return -1;
+	}
+
+	if (gdbf == NULL) {
+		gdbf = dbf;
+	} else {
+		return -1;
+	}
+	path = crc32_file_name(gdbf->root, sha1);
+	return unlink(path);
+}
+
+/* copied from sha1_file.c */
+/* Size of directory component, including the ending '/' */
+static inline int directory_size(const char *filename)
+{
+	const char *s = strrchr(filename, '/');
+	if (!s)
+		return 0;
+	return s - filename + 1;
+}
+
+
+/* copied from sha1_file.c */
+static int create_tmpfile(char *buffer, size_t bufsiz, const char *filename)
+{
+	int fd, dirlen = directory_size(filename);
+
+	if (dirlen + 20 > bufsiz) {
+		errno = ENAMETOOLONG;
+		return -1;
+	}
+	memcpy(buffer, filename, dirlen);
+	strcpy(buffer + dirlen, "tmp_obj_XXXXXX");
+	fd = git_mkstemp_mode(buffer, 0444);
+	if (fd < 0 && dirlen && errno == ENOENT) {
+		/* Make sure the directory exists */
+		memcpy(buffer, filename, dirlen);
+		buffer[dirlen-1] = 0;
+		if (mkdir(buffer, 0777) || adjust_shared_perm(buffer))
+			return -1;
+
+		/* Try again */
+		strcpy(buffer + dirlen - 1, "/tmp_obj_XXXXXX");
+		fd = git_mkstemp_mode(buffer, 0444);
+	}
+	return fd;
+}
+
+/* copied from sha1_file.c */
+static int write_buffer(int fd, const void *buf, size_t len)
+{
+	if (write_in_full(fd, buf, len) < 0)
+		return error("file write error (%s)", strerror(errno));
+	return 0;
+}
+
+/* copied from sha1_file.c */
+/* Finalize a file on disk, and close it. */
+static void close_sha1_file(int fd)
+{
+	if (fsync_object_files)
+		fsync_or_die(fd, "sha1 file");
+	if (close(fd) != 0)
+		die_errno("error when closing sha1 file");
+}
+
+
+int crcdb_process(crcdb_t gdbf, const unsigned char *sha1, uint32_t objcrc32) {
+	uint32_t oldcrc;
+	int has_oldcrc = 0;
+	char *path;
+	if (!initialized || gdbf == no_dbf) {
+	  return -1;
+	}
+	if (gdbf == NULL) gdbf = dbf;
+	path = crc32_file_name(gdbf->root, sha1);
+	has_oldcrc = crcdb_lookup_aux(path, &oldcrc);
+	if (gdbf == dbf && !has_oldcrc) {
+		uint32_t crc;
+		static char ctmpfile[PATH_MAX];
+		int fdc = create_tmpfile(ctmpfile, sizeof(ctmpfile), path);
+		if (fdc < 0) {
+		  return -1;
+		}
+		crc = (objcrc32);
+		if (fdc >= 0 && write_buffer(fdc, &crc, sizeof (crc)) < 0) {
+			close_sha1_file(fdc);
+			return -1;
+		}
+		if (fdc >= 0) {
+			close_sha1_file(fdc);
+			return (move_temp_to_file(ctmpfile, path) == 0)?
+				0: -1;
+		}
+		return -1;
+	} else if (has_oldcrc) {
+		if (oldcrc != objcrc32) {
+			die("SHA1 COLLISION WHEN INSERTING OBJECT %s",
+			    sha1_to_hex(sha1));
+			return -1;
+		}
+		return 1;
+	} else {
+		return 0;
+	}
+}
+
+
+void crcdb_close(crcdb_t gdbf) {
+	return;
+}
+
+void crcdb_close_alt(struct alternate_object_database *alt) {
+	return;
+}
+
+
+
+int crcdb_reorganize(crcdb_t gdbf) {
+	if (!initialized || gdbf == no_dbf) {
+	  return -1;
+	}
+	if (gdbf == NULL) {
+		return 0;
+	} else {
+		return -1;
+	}
+}
+
+
+
+void crcdb_finish(void) {
+	int i;
+	if (!initialized) {
+		return;
+	}
+	dbf->root = NULL;
+
+	for (i = 0; i < alt_in_use; i++) {
+		free(alt_map[i].db.root);
+		alt_map[i].db.root = NULL;
+	}
+	memset(alt_map, 0, sizeof(struct alt_map) *alt_in_use);
+	alt_in_use = 0;
+	initialized = 0;
+	alt_initialized = 0;
+}
diff --git a/packdb.h b/packdb.h
new file mode 100644
index 0000000..c4320ac
--- /dev/null
+++ b/packdb.h
@@ -0,0 +1,107 @@
+#ifndef PACKDB_H
+#define PACKDB_H
+
+#include<stdint.h>
+
+/**
+ *  Initialize the database.
+ *  This opens a database file in the objects directory named crcs,
+ *  used to store CRCS of objects (uncompressed, excluding the header)
+ *  for hash-collision detection.
+ */
+extern void packdb_init(void);
+
+/**
+ * Check if the database has been initialized.
+ * Returns:
+ *   1 if packdb_init has been called; false otherwise.
+ */
+extern int packdb_initialized(void);
+
+/**
+ * Open the persistent database to store a copy of obj CRCs in pack index files.
+ * Nested calls are allowed, but must be balanced by calls to packdb_close.
+ * For nested calls, subsequent ones merely increment a reference count.
+ *
+ * This is used to create space-efficient storage of object CRCs that
+ * are not associated with loose objects (e.g., because they are in pack
+ * files).  Intended for use when building pack files.
+ *
+ * Note:
+ *   Interacting with another process that calls this function on the
+ *   same repository may lead to deadlock unless packdb_close is
+ *   called before that interaction.
+ */
+extern void packdb_open(void);
+
+/**
+ * Store a crc in the persistent database for creating pack index files.
+ *
+ * Arguments:
+ *   sha1 - the key for the entry (a 20-byte sha1 hash)
+ *   crc - the crc to store (the crc of an object's data)
+ * Returns:
+ *   0 if we added a new entry, 1 if the entry already exists, -1 on error
+ */
+extern int packdb_process(const unsigned char *sha1, uint32_t objcrc32);
+
+/**
+ * Lookup a CRC from a database.
+ *
+ * Arguments:
+ *        dbf - the CRC database; NULL for the default database
+ *       sha1 - the key for the lookup (a 20-byte SHA1 digest)
+ *  objcrc32p - a pointer to a uint32_t to store the returned value when
+ *              an entry in the database exists.
+ * Returns:
+ *   0 if no entry, 1 if there is an existing entry.
+ */
+extern int packdb_lookup(const unsigned char *sha1, uint32_t *objcrc32p);
+
+/**
+ * Moves a crc into the persistent database for creating pack index files.
+ * This will delete the entry from the 'loose-object' crc database.
+ *
+ * Arguments:
+ *   sha1 - the key for the entry (a 20-byte sha1 hash)
+ * Returns:
+ *   0 if we stored an entry in the crcdb database, 1 if the entry already
+ *     existed in the packdb database, -1 on error or if there was no entry
+ *     to store.
+ */
+extern int packdb_store(unsigned char *sha1);
+
+
+/**
+ * Remove a CRC from a database.
+ *
+ * Arguments:
+ *        dbf - the CRC database; NULL for the default database
+ *       sha1 - the key for the lookup (a 20-byte SHA1 digest)
+ *
+ * Returns:
+ *   0 on success; -1 if the entry did not exist or if an entry
+ *   could not be deleted
+ */
+extern int packdb_remove(const unsigned char *sha1);
+
+
+/**
+ * Reorganize the database.
+ * Returns:
+ *   0 on success; -1 on failure
+ */
+extern int packdb_reorganize(void);
+
+/**
+ * Close the database file.
+ */
+extern void packdb_close(void);
+
+/**
+ * Close the database if opened and uninitialize the module.
+ * This is intended to be called when the module is no longer needed.
+ */
+extern void packdb_finish(void);
+
+#endif
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] Documentation update for 'git branch --list'
From: Vincent van Ravesteijn @ 2011-11-30  5:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git
In-Reply-To: <7vobw4doey.fsf@alter.siamese.dyndns.org>

Op 22-11-2011 19:04, Junio C Hamano schreef:
> Vincent van Ravesteijn<vfr@lyx.org>  writes:
>
>> Op 21-11-2011 18:37, Junio C Hamano schreef:
>> ...
>>> It is natural to expect "git branch --merged pu vr/\*" to list branches
>>> that are contained in 'pu' whose names match the given pattern, but it
>>> seems to try creating a branch called "vr/*" and fails, for example.
>> If this is what you naturally would expect, I would expect the
>> following "git branch vr/*" to work as well.
>> What would you say if we try to interpret the argument as a pattern
>> when the argument is not a valid ref name?
> We don't, as that is inviting mistakes. "git branch vr/*" if you have a
> vr/ directory in your working tree may create vr/a branch from where the
> tip of vr/b points at by mistake.
>
> The "--merged" option is an explicit clue that the user is not interested
> in creating new branch, and the string being a pattern is additional clue.
> The "--list" option was recently added for the explicit purpose of giving
> such a clue as safety measure.

Well, that was the answer that I foresaw.

I will compose a patch implementing an at least consistent behaviour.

Vincent

^ permalink raw reply

* Re: [PATCH 11/13] strbuf: add strbuf_add*_urlencode
From: René Scharfe @ 2011-11-30  5:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git
In-Reply-To: <20111130032028.GA24704@sigill.intra.peff.net>

Am 30.11.2011 04:20, schrieb Jeff King:
> On Wed, Nov 30, 2011 at 12:26:20AM +0100, René Scharfe wrote:
> 
>>>>> +static int is_rfc3986_reserved(char ch)
>>>>> +{
>>>>> +	switch (ch) {
>>>>> +	case '!': case '*': case '\'': case '(': case ')': case ';':
>>>>> +	case ':': case '@': case '&': case '=': case '+': case '$':
>>>>> +	case ',': case '/': case '?': case '#': case '[': case ']':
>>>>> +		return 1;
>>>>> +	}
>> [...]
>> Sorry for my bikeshedding, but I'd paint it like this:
>>
>> 	return !!strchr("!*'();:@&=+$,/?#[]", ch);
> 
> I was always under the impression that computed jumps via "switch" would
> out-perform even an optimized strchr. Of course, I never tested. And I
> doubt performance is even relevant here, and I admit I don't care overly
> much. I find them both equally readable.
> 
> I'm going to leave it as-is unless somebody else wants to say "I
> strongly prefer version X".

Sure, the second one is significantly slower than the first one.  I just
prefer it based one its looks in case performance doesn't matter, but
that's probably just me being (sometimes too) fond of terseness. :)

René

^ permalink raw reply

* Re: [PATCH 11/13] strbuf: add strbuf_add*_urlencode
From: Junio C Hamano @ 2011-11-30  5:40 UTC (permalink / raw)
  To: Jeff King; +Cc: René Scharfe, git
In-Reply-To: <20111130032028.GA24704@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> On Wed, Nov 30, 2011 at 12:26:20AM +0100, René Scharfe wrote:
>
>> >>> +static int is_rfc3986_reserved(char ch)
>> >>> +{
>> >>> +	switch (ch) {
>> >>> +	case '!': case '*': case '\'': case '(': case ')': case ';':
>> >>> +	case ':': case '@': case '&': case '=': case '+': case '$':
>> >>> +	case ',': case '/': case '?': case '#': case '[': case ']':
>> >>> +		return 1;
>> >>> +	}
>> [...]
>> Sorry for my bikeshedding, but I'd paint it like this:
>> 
>> 	return !!strchr("!*'();:@&=+$,/?#[]", ch);
>
> I was always under the impression that computed jumps via "switch" would
> out-perform even an optimized strchr. Of course, I never tested. And I
> doubt performance is even relevant here, and I admit I don't care overly
> much. I find them both equally readable.
>
> I'm going to leave it as-is unless somebody else wants to say "I
> strongly prefer version X".

I find the switch/case one much easier to read and count, especially since
all the choices are essentially line-noise characters.

Just make sure you indent it correctly ;-)

^ permalink raw reply

* Re: [PATCH] Implement fast hash-collision detection
From: Bill Zaumen @ 2011-11-30  4:01 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, gitster, pclouds, peff, torvalds
In-Reply-To: <CAJo=hJtFT55Ucyij9esr3Hd9yJ6XCxatK7vjPOLMKow57HqBoQ@mail.gmail.com>

Note: for some reason my email is not showing up on the mailing list.
I'm trying a different email address - previously my 'From' field
contained a subaddress "+git" but gmail won't put that in the 'Sender'
field, so possibly the email is being filtered for that reason.

On Tue, 2011-11-29 at 09:08 -0800, Shawn Pearce wrote:

> I don't think you understand how these thin packs are processed.

I think the confusion was due to me being a bit too terse.  The
documentation clearly states that thin packs allow deltas to be
sent when the delta is based on an object that the server and client
both have in common, given the commits each already has.  If there is
one server and one client, there isn't an issue.  The case I meant is
the one in which a user does a fetch from one server, gets a forged
blob, and then fetches from another server with the original blob, and
with additional commits along the same branch. If a server bases the
delta off of the original blob, and the client applies the delta to the
forged blob, the client will most likely end up with a blob with a
different SHA-1 hash than the one expected.  Since an object in a tree
is then missing (no object with the expected SHA-1 hash), the repository
is corrupted.

The "first to arrive wins" policy isn't sufficient in one specific case:
multiple remote repositories where new commits are added asynchronously,
with the repositories out of sync possibly for days at a time (e.g.,
over a 3-day weekend).  In this case, the first to arrive at one
repository may not be the first to arrive at another, so what happens at
a particular client in the presence of hash collisions is dependent on
the sequence of remotes from which updates were fetched.  The risk
occurs in the window where the repositories are out of sync.

Regarding the kernel.org problem that you used as a separate example,
while it was fortunately possible to rebuild things (and git provided
significant advantages), earlier detection of the problem might have
reduced the time for which kernel.org was down.  Early detection of
errors in general is a good practice if it can be done at a reasonable
cost.

> Trust. Review. Verify.

While good advice in principle, you should keep in mind that there are
a lot of people out there working at various companies who are not as
capable as you are.  Some of them are overworked and make mistakes
because they've been working 16 hour days for weeks trying to meet a
deadline. Given that, extra checks to catch problems early
are probably a good idea if they don't impact performance significantly.

^ permalink raw reply

* Re: [PATCH 11/13] strbuf: add strbuf_add*_urlencode
From: Jeff King @ 2011-11-30  3:20 UTC (permalink / raw)
  To: René Scharfe; +Cc: Junio C Hamano, git
In-Reply-To: <4ED56A1C.9050800@lsrfire.ath.cx>

On Wed, Nov 30, 2011 at 12:26:20AM +0100, René Scharfe wrote:

> >>> +static int is_rfc3986_reserved(char ch)
> >>> +{
> >>> +	switch (ch) {
> >>> +	case '!': case '*': case '\'': case '(': case ')': case ';':
> >>> +	case ':': case '@': case '&': case '=': case '+': case '$':
> >>> +	case ',': case '/': case '?': case '#': case '[': case ']':
> >>> +		return 1;
> >>> +	}
> [...]
> Sorry for my bikeshedding, but I'd paint it like this:
> 
> 	return !!strchr("!*'();:@&=+$,/?#[]", ch);

I was always under the impression that computed jumps via "switch" would
out-perform even an optimized strchr. Of course, I never tested. And I
doubt performance is even relevant here, and I admit I don't care overly
much. I find them both equally readable.

I'm going to leave it as-is unless somebody else wants to say "I
strongly prefer version X".

-Peff

^ permalink raw reply

* Auto update submodules after merge and reset
From: Max Krasnyansky @ 2011-11-30  0:55 UTC (permalink / raw)
  To: git

Does anyone have a pointer to a thread/discussion that explains why git 
submodules are not auto
updated when the superproject is updated (merge, reset, etc) by default?

Assuming a simple and default setup where submodule update policy is set 
to "checkout".
It seems that the default and sane behavior should be to update 
(checkout) corresponding submodule
commit to track the superproject.
I can't seem to find convincing explanation why it's not the case :). 
Having to manually update
submodules after pull or reset has been error prone and confusing for 
the devs I work with.

I'm thinking about adding a config option that would enable automatic 
submodule update but wanted
to see if there is some fundamental reason why it would not be accepted.

Thanx
Max

^ permalink raw reply

* Re: [PATCH] git-gui: Set both 16x16 and 32x32 icons on X to pacify Xming.
From: Samuel Bronson @ 2011-11-30  0:02 UTC (permalink / raw)
  To: git; +Cc: Pat Thoyts, Shawn O. Pearce, Samuel Bronson
In-Reply-To: <1321640015-6663-1-git-send-email-naesten@gmail.com>

On Fri, Nov 18, 2011 at 1:13 PM, Samuel Bronson <naesten@gmail.com> wrote:
> It would be better if the 32x32 icon was equivalent to the one used on
> Windows (in git-gui.ico), but I'm not sure how that would best be done,
> so I copied this code from gitk instead.
> ---
>  git-gui.sh |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/git-gui.sh b/git-gui.sh
> index c190cbe..9d01039 100755
> --- a/git-gui.sh
> +++ b/git-gui.sh
> @@ -729,7 +729,12 @@ if {[is_Windows]} {
>                gitlogo put gray26  -to  5 15 11 16
>                gitlogo redither
>
> -               wm iconphoto . -default gitlogo
> +               # TODO: should use something equivalent to the 32x32 image in
> +               # the .ico file
> +               image create photo gitlogo32    -width 32 -height 32
> +               gitlogo32 copy gitlogo -zoom 2 2
> +
> +               wm iconphoto . -default gitlogo gitlogo32
>        }
>  }

Hmm. Nothing seems to have happened with this patch yet. Any
suggestions on how to bring it to the attention of the git-gui people?

^ permalink raw reply

* Re: [PATCH 11/13] strbuf: add strbuf_add*_urlencode
From: René Scharfe @ 2011-11-29 23:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git
In-Reply-To: <20111129211950.GD1793@sigill.intra.peff.net>

Am 29.11.2011 22:19, schrieb Jeff King:
> On Tue, Nov 29, 2011 at 10:19:00AM -0800, Junio C Hamano wrote:
> 
>>> +static int is_rfc3986_reserved(char ch)
>>> +{
>>> +	switch (ch) {
>>> +	case '!': case '*': case '\'': case '(': case ')': case ';':
>>> +	case ':': case '@': case '&': case '=': case '+': case '$':
>>> +	case ',': case '/': case '?': case '#': case '[': case ']':
>>> +		return 1;
>>> +	}
>>> +	return 0;
>>> +}
>>
>> Part of me wonders if we still have extra bits in sane_ctype[] array but
>> that one is cumbersome to update, and the above should be easier to read
>> and maintain.
> 
> We have 2 bits left. I did consider it, but it just seemed excessively
> cumbersome for something that really doesn't need to be that fast (if it
> is indeed any faster than this case statement).

Sorry for my bikeshedding, but I'd paint it like this:

	return !!strchr("!*'();:@&=+$,/?#[]", ch);

René

^ permalink raw reply

* Re: [PATCH] gitweb: Call to_utf8() on input string in chop_and_escape_str()
From: Jürgen Kreileder @ 2011-11-29 22:14 UTC (permalink / raw)
  To: Jakub Narebski, git
In-Reply-To: <201111292250.04800.jnareb@gmail.com>

On Tue, Nov 29, 2011 at 22:50, Jakub Narebski <jnareb@gmail.com> wrote:
> Jürgen Kreileder wrote:
>
>> a) To fix the comparison with the chopped string
>> b) To give the title attribute correct encoding
>>
>> Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
>> ---
>>  gitweb/gitweb.perl |    4 ++--
>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 4f0c3bd..4237ea6 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -1695,11 +1695,11 @@ sub chop_and_escape_str {
>>       my ($str) = @_;
>
> Why not simply
>
>        my $str = to_utf8(shift);
>

Good question.  Because I thought it broke something when I tested it.
I can't reproduce that now, though.  Might have been something unrelated.
So:

-- >8 --
Subject: [PATCH] gitweb: Call to_utf8() on input string in
 chop_and_escape_str()

a) To fix the comparison with the chopped string
b) To give the title attribute correct encoding

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
---
 gitweb/gitweb.perl |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index bfada0e..036ae46 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1699,6 +1699,7 @@ sub chop_and_escape_str {
 	my ($str) = @_;

 	my $chopped = chop_str(@_);
+	$str = to_utf8($str);
 	if ($chopped eq $str) {
 		return esc_html($chopped);
 	} else {
-- 
1.7.5.4

^ permalink raw reply related

* Re: Re: [RFC/PATCH] add update to branch support for "floating submodules"
From: Heiko Voigt @ 2011-11-29 22:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vr51htbsy.fsf@alter.siamese.dyndns.org>

Hi,

On Wed, Nov 09, 2011 at 10:01:33AM -0800, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> 
> > This is almost ready but I would like to know what users of the
> > "floating submodule" think about this.
> 
> Thanks for working on this.
> 
> I do like to hear from potential users as well, because the general
> impression we got was that floating submodules is not a real need of
> anybody, but it is merely an inertia of people who (perhaps mistakenly)
> thought svn externals that are not anchored to a particular revision is a
> feature when it is just a limitation in reality. During the GitTogether'11
> we learned that Android that uses floating model does not really have to.

Since we did not get any reply from potential floating submodule users I
do not mind to drop this patch for now. It is archived in the mailing list
and it should be easy to revive once there is real world need for it.

Once we have the "exact" model support for checkout and friends this
might be a handy tool to update submodules before releases and such. But
currently I would like to focus on the "exact" front first.

Cheers Heiko

^ permalink raw reply

* Re: [PATCH] Implement fast hash-collision detection
From: Jeff King @ 2011-11-29 22:05 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Bill Zaumen, git, gitster, pclouds, torvalds
In-Reply-To: <CAJo=hJtFT55Ucyij9esr3Hd9yJ6XCxatK7vjPOLMKow57HqBoQ@mail.gmail.com>

On Tue, Nov 29, 2011 at 09:08:27AM -0800, Shawn O. Pearce wrote:

> As Peff pointed out elsewhere in this thread, the odds of a SHA-1
> collision in a project are low, on the order of 1/(2^80).

Minor nit: it's actually way less than that. You have to do on the order
of 2^80 operations to get a 50% chance of a collision. But that's not
the probability for a collision given a particular number of
operations[1].

The probability for a SHA-1 collision on 10 million hashes (where
linux-2.6 will be in a decade or two) is about 1/(2^115).

That doesn't change the validity of any of your points, of course. 1 in
2^80 and 1 in 2^115 are both in the range of "impossibly small enough
not to care about".

To continue our astronomy analogies, NASA estimates[2] the impact
probability of most tracked asteroids in the 10^6 range (around 2^20).
So getting a collision in linux-2.6 in the next decade has roughly the
same odds as the Earth being hit by 5 or 6 large asteroids.

-Peff

[1] http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem

[2] http://neo.jpl.nasa.gov/risk/

^ permalink raw reply

* Re: Re: Git Submodule Problem - Bug?
From: Heiko Voigt @ 2011-11-29 22:03 UTC (permalink / raw)
  To: Jens Lehmann; +Cc: Manuel Koller, Fredrik Gustafsson, Thomas Rast, git
In-Reply-To: <4ED5196B.5030200@web.de>

Hi,

On Tue, Nov 29, 2011 at 06:42:03PM +0100, Jens Lehmann wrote:
> Am 29.11.2011 11:41, schrieb Fredrik Gustafsson:
> > On Tue, Nov 29, 2011 at 11:25:41AM +0100, Thomas Rast wrote:
> >> So maybe the right questions to ask would be: what's the *official*
> >> way of removing a submodule completely?  Do we support overwriting
> >> submodules in the way Manuel wanted to?  Why not? :-)
> > 
> > I suggest that we add a command for that;
> > git submodule remove <submodule>
> 
> Hmm, to me it looks like the problem is in "git submodule add". It
> doesn't check if the submodule repo it finds in .git/modules matches
> the one the user wants to create. So we end up reviving the first
> submodule although the user wants to use a completely different repo.
> 
> One solution could be to only let "git submodule update" revive
> submodules from .git/modules and make "git submodule add" error out
> if it finds the git directory of a submodule with the same name in
> .git/modules. But currently there is no way to tell "git submodule add"
> to use a different submodule name (it always uses the path as a name),
> so we might have to add an option to do that and tell the user in the
> error message how he can add a different submodule under the same
> path.

I think this is the way to go. We teached submodule add to revive a
local submodule. Further thinking about it this is probably not what the
users wants in most cases. For update its the right thing but for add we
should probably tell the user that there is already a local submodule in
the way and give him the option to take it or that he should remove it.

> Another solution could be that "git submodule add" detects that a
> submodule with the name "sub" did exist and chooses a different name
> (say "sub2") for the the new one. Then the user wouldn't have to
> cope with the problem himself.

In my opinion this is too much automatism. We could prompt for a new
name to support the user but I do not think this mechanism should be
automatic.

How about this:

The user issues 'git submodule add foo' and we discover that there is
already a local clone under the name foo. Git then asks something like
this

	Error when adding: There is already a local submodule under the
	name 'foo'.

	You can either rename the submodule to be added to a different
	name or manually remove the local clone underneath
	.git/modules/foo. If you want to remove the local clone please
	quit now.

	We strongly suggest that you give each submodule a unique name.
	Note: This name is independent from the path it is bound to.

	What do you want me to do ([r]ename it, [Q]uit) ?

When the user chooses 'rename' git will prompt for a new name.

If we are going to support the remove use case with add we additionally
need some logic to deal with it during update (which is not supported
yet AFAIK). But we probably need this support anyway since between
removal and adding a new submodule under the same can be a long time.
If users switch between such ancient history and the new history we
would have the same conflict.

We could of course just error out and tell the user that he has to give
the submodule an uniqe name. If the user does not do so leave it to him
to deal with the situation manually.

What do you think?

Cheers Heiko

^ permalink raw reply

* Re: [PATCH] Implement fast hash-collision detection
From: Bill Zaumen @ 2011-11-29 21:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds
In-Reply-To: <20111129090733.GA22046@sigill.intra.peff.net>

Thanks for mentioning the 100K limit, which I didn't know about.  Will
have to try to see how to split it into two patches.

The intent is to increase the cost of a malicious attack, which requires
generating two different files with the same SHA-1 value, detect such an
attack early, and to slow such an attack down - because of Git's rule
that the first object with a SHA-1 value is the one the repository has,
if it takes longer to generate a collision than the time it takes to get
the original object into all repositories (which is done manually by
multiple individuals), the forged file will never appear in any
"official" repository.

The additional CRC (easily changed to whatever message digest one might
prefer) makes a malicious attack far more difficult: the modified file
has to have both the same SHA-1 hash (including the Git header) and 
the same CRC (not including the Git header).  An efficient algorithm to
do both simultaneously does not yet exist.  So, if we could generate a
SHA-1 collision in one second, it would presumably take billions of
seconds (many decades of continuous computation) to generate a SHA-1
hash with the same CRC, and well before a year has elapsed, the original
object should have been in all the repositories, preventing a forged
object from being inserted. Of course, eventually you might need a
real message digest.

The weakness of a CRC as an integrity check is not an issue since it is
never used alone: it's use is more analogous to the few extra bits added
to a data stream when error-detecting codes are used.  I used a CRC in
the initial implementation rather than a message digest because it is
faster, and because the initial goal was to get things to work
correctly.  In any case, the patch does not eliminate any code in which
Git already does a byte-by-byte comparison.  In cases where Git
currently assumes that two objects are the same because the SHA-1 hashes
are the same, the patch compares CRCs as an additional test.

Regarding your [Jeff's] second concern, "how does this alternative
digest have any authority?" there are two things to keep in mind. First,
it is a supplement to the existing digest.  Second, any value of the CRC
that is stored permanently (baring bugs, in my implementation, of
course) is computed locally - when a loose object is created or when a
pack file's index is created.  At no point is a CRC that was obtained
from another repository trusted. While the patch modifies Git so that it
can send CRCs when using the git protocol, these CRCs are never stored,
but are instead used only for cross checks.  If one side or the other
"lies", you get an error.  

To give a concrete example, during a fetch, the git protocol currently
sends "have" messages that contain the SHA-1 hashes of commits.  The
extension allows two CRCs to be sent along with each hash.  If these do
not match the local values (tested only if the local values exist),
something is wrong and you get an error report that the server sends to
the client, but the server never uses these CRCs for any other purpose
and the server never sends its CRCs to the client because of
backwards-compatibility issues. For objects that are transferred, you
end up with a pack file, with index-pack called to build the index (and
with the patch, the corresponding MDS file), but index-pack already does
a byte-by-byte comparison to detect collisions - the comparison is much
faster than the SHA-1 computation index-pack has to do anyway.

Where this helps is when one is using multiple repositories. If you
fetch a commit from repository B, which we'll assume has a forged blob
(different content, but the original SHA-1 hash), and then run fetch
using repository A, which has has the same commit with the original
blob, the forged blob will not be transferred from Server A and the
client will not be notified that there is an inconsistency - the
protocol is "smart" enough to know that the client already has the
commit and assumes there is nothing to do regarding it.

BTW, regarding your [Jeff's] discussion about putting an additional
header in commit messages - I tried that.  The existing versions of
Git didn't like it: barring a bug in my test code, it seems that Git
expects headers in commit messages to be in a particular order and
treats deviations from that to be an error.  I even tried appending
blank lines at the end of a commit, with spaces and tabs encoding an
additional CRC, and that didn't work either - at least it never got
through all the test programs, failing in places like the tests
involving notes. In any case, you'd have to phase in such a change
gradually, first putting in the code to read the new header if it is
there, and subsequently (after ample time so that everyone is running
a sufficiently new version) enabling the code to create the new
header.

Also, regarding "At that point, I really wonder if a flag day to switch
to a new repository format is all that bad," if that turns out to be
the decision, I'd recommend doing it sooner rather than later. The
reason is cost, which grows with the number of git users and the
number and size of Git repositories.

Bill

^ permalink raw reply

* Re: [PATCH] gitweb: Call to_utf8() on input string in chop_and_escape_str()
From: Jakub Narebski @ 2011-11-29 21:50 UTC (permalink / raw)
  To: Jürgen Kreileder; +Cc: git
In-Reply-To: <CAKD0Uuy8y7Dc6gfvYVe-FJ=Reiu0M3wOY4r4VVPtEYmahZcdwA@mail.gmail.com>

Jürgen Kreileder wrote:

> a) To fix the comparison with the chopped string
> b) To give the title attribute correct encoding
> 
> Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
> ---
>  gitweb/gitweb.perl |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 4f0c3bd..4237ea6 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -1695,11 +1695,11 @@ sub chop_and_escape_str {
>  	my ($str) = @_;

Why not simply

  	my $str = to_utf8(shift);
 
>  	my $chopped = chop_str(@_);
> -	if ($chopped eq $str) {
> +	if ($chopped eq to_utf8($str)) {
>  		return esc_html($chopped);
>  	} else {
>  		$str =~ s/[[:cntrl:]]/?/g;
> -		return $cgi->span({-title=>$str}, esc_html($chopped));
> +		return $cgi->span({-title => to_utf8($str)}, esc_html($chopped));
>  	}
>  }
> 
> -- 
> 1.7.5.4
> 

-- 
Jakub Narebski
Poland

^ permalink raw reply

* [PATCH] gitweb: esc_html() site name for title in OPML
From: Jürgen Kreileder @ 2011-11-29 21:45 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

This escapes the site name in OPML.  Also fixes encoding issues
because esc_html() uses to_utf().

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
---
 gitweb/gitweb.perl |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 4f0c3bd..df747c1 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -7699,11 +7699,12 @@ sub git_opml {
 		-charset => 'utf-8',
 		-content_disposition => 'inline; filename="opml.xml"');

+	my $title = esc_html($site_name);
 	print <<XML;
 <?xml version="1.0" encoding="utf-8"?>
 <opml version="1.0">
 <head>
-  <title>$site_name OPML Export</title>
+  <title>$title OPML Export</title>
 </head>
 <body>
 <outline text="git RSS feeds">
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH] gitweb: Call to_utf8() on input string in chop_and_escape_str()
From: Jürgen Kreileder @ 2011-11-29 21:41 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

a) To fix the comparison with the chopped string
b) To give the title attribute correct encoding

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
---
 gitweb/gitweb.perl |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 4f0c3bd..4237ea6 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1695,11 +1695,11 @@ sub chop_and_escape_str {
 	my ($str) = @_;

 	my $chopped = chop_str(@_);
-	if ($chopped eq $str) {
+	if ($chopped eq to_utf8($str)) {
 		return esc_html($chopped);
 	} else {
 		$str =~ s/[[:cntrl:]]/?/g;
-		return $cgi->span({-title=>$str}, esc_html($chopped));
+		return $cgi->span({-title => to_utf8($str)}, esc_html($chopped));
 	}
 }

-- 
1.7.5.4

^ permalink raw reply related

* Re: [PATCH 12/13] credentials: add "store" helper
From: Jeff King @ 2011-11-29 21:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vsjl6ssf9.fsf@alter.siamese.dyndns.org>

On Tue, Nov 29, 2011 at 10:19:06AM -0800, Junio C Hamano wrote:

> > +	while (strbuf_getline(&line, fh, '\n') != EOF) {
> > +		credential_from_url(&entry, line.buf);
> > +		if (entry.username && entry.password &&
> > +		    credential_match(c, &entry)) {
> 
> This looks curious; isn't checking .username and .password part of the
> responsibility of credential_match()? And even if entry lacks password
> (which won't happen in the context of this program, given the
> implementation of store_credential() below) shouldn't it still be
> considered a match?

credential_match will check .username, if the pattern mentions it. It
will never check .password. My intent here was to enforce well-formed
entries in the credential file. So you could add:

  http://example.com/

to the credential file, but it's just meaningless noise. It
doesn't actually tell us a username or password.

The helper won't add such an entry itself, but given the simplicity of
the format, I wanted to leave the door open for curious hackers to
populate it manually if they choose.

I think you're right that:

  http://user@example.com/

is potentially meaningful, and this would skip that. OTOH, you would be
much better served to just do:

  git config credential.http://example.com.username user

So I consider it a slight abuse of this helper in the first place.

> > +static void rewrite_credential_file(const char *fn, struct credential *c,
> > +				    struct strbuf *extra)
> > +{
> > +	umask(077);
> 
> Curious placement of umask(). I would expect a function that has its own
> call to umask() restore it before it returns, and a stand-alone program
> whose sole purpose is to work with a private file, setting a tight umask
> upfront at the beginning of main() may be easier to understand.

I think that is largely a holdover from the original implementation,
which set the umask and did other black magic before calling
git_config_set. I agree it would make more sense at the beginning of the
program. Will change.

> > +	if (hold_lock_file_for_update(&credential_lock, fn, 0) < 0)
> > +		die_errno("unable to get credential storage lock");
> > +	parse_credential_file(fn, c, NULL, print_line);
> > +	if (extra)
> > +		print_line(extra);
> 
> An entry for a newly updated password comes at the end of the file,
> instead of replacing an entry already in the file in-place? Given that
> parse_credential_file() when processing a look-up request (which is the
> majority of the case) stops upon finding a match, it might make more sense
> to have the new one (which may be expected to be used often) at the
> beginning instead, no?

Yeah. It's a linear search. Your worst-case is always going to be O(n),
but I just assumed n would remain relatively small and we wouldn't care
(if it isn't, the right solution is probably a smarter data structure).

But your optimization is trivial to implement, so it's probably worth
doing.

> > +	if (commit_lock_file(&credential_lock) < 0)
> > +		die_errno("unable to commit credential store");
> > +}
> > +
> > +static void store_credential(const char *fn, struct credential *c)
> > +{
> > +	struct strbuf buf = STRBUF_INIT;
> > +
> > +	if (!c->protocol || !(c->host || c->path) ||
> > +	    !c->username || !c->password)
> > +		return;
> [...]
> > +static void remove_credential(const char *fn, struct credential *c)
> > +{
> > +	if (!c->protocol || !(c->host || c->path))
> > +		return;
> 
> The choice of the fields looks rather arbitrary. I cannot say "remove all
> the credentials whose username is 'gitster' at 'github.com' no matter what
> protocol is used", but I can say "remove all credentials under any name
> for any host as long as the transfer goes over 'https' and accesses a
> repository at 'if/xyzzy' path", it seems.

It is kind of arbitrary. The storage format is URLs, which is why
store_credential is a little pedantic. We can't store something that
doesn't have a protocol part, as that is a required part of the URL
(actually, in URL-speak this is the "scheme"; I wonder if we should use
the same term here).

I was thinking we need a protocol for the same reason in
remove_credential, but I think you are right. We never actually convert
it to a URL, so in theory you could do:

  git credential-store erase <<\EOF
  username=gitster
  host=github.com
  EOF

Again, not an operation that git will ever perform, but I guess
something that people might want to do (I had always assumed the
"$EDITOR ~/.git-credentials" was going to be the preferred way of doing
such operations :) ).

I don't think there's any harm in loosening that condition.

-Peff

^ permalink raw reply

* Re: [PATCH] gitweb: Don't append ';js=(0|1)' to external links
From: Jürgen Kreileder @ 2011-11-29 21:31 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
In-Reply-To: <m3pqgaloda.fsf@localhost.localdomain>

On Tue, Nov 29, 2011 at 20:28, Jakub Narebski <jnareb@gmail.com> wrote:
> Jürgen Kreileder <jk@blackdown.de> writes:
[...]
> Thanks for this, but I think a better solution would be to explicitly
> mark the few external links we have e.g. with 'class="external"', and
> use that to avoid adding ';js=(0|1)' to them.

This won't work because there are more than a few external links.  Think of
links added in the header or footer or via a project specific README.html.

You would have to do it the other way round: Mark all internal links.

> This has the advantage that we can use different style to mark
> outgoing external links.
>
> I even have such patch somewhere in the StGit stack...
> -- >8 --
> Subject: [PATCH] gitweb: Mark external links
>
> ...and do not add 'js=1' to them with JavaScript.
>
> Both $logo_url and $home_link links are now marked with "external"
> class, and fixLink does not add 'js=1' to them on click.  We add
> 'js=1' to internal link to make server-side of gitweb know that it can
> use JavaScript-only actions; we shouldn't do this for extrenal links,
> as 'js=1' might mean something else to them.
>
> Note that only links using A element matter: images (linked using
> IMG), stylesheets (linked using STYLE) and JavaScript files (linked
> using SCRIPT) were never affected.
>
> Signed-off-by: Jakub Narebski <jnareb@gmail.com>
> ---
>  gitweb/gitweb.perl                       |    5 ++++-
>  gitweb/static/js/javascript-detection.js |    5 +++++
>  2 files changed, 9 insertions(+), 1 deletions(-)
>
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 7456a4b..f1c1caa 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -3626,13 +3626,16 @@ EOF
>        print "<div class=\"page_header\">\n";
>        if (defined $logo) {
>                print $cgi->a({-href => esc_url($logo_url),
> +                              -class => "external",
>                               -title => $logo_label},
>                              $cgi->img({-src => esc_url($logo),
>                                         -width => 72, -height => 27,
>                                         -alt => "git",
>                                         -class => "logo"}));
>        }
> -       print $cgi->a({-href => esc_url($home_link)}, $home_link_str) . " / ";
> +       print $cgi->a({-href => esc_url($home_link)
> +                      -class => "external"},
> +                     $home_link_str) . " / ";
>        if (defined $project) {
>                print $cgi->a({-href => href(action=>"summary")}, esc_html($project));
>                if (defined $action) {
> diff --git a/gitweb/static/js/javascript-detection.js b/gitweb/static/js/javascript-detection.js
> index 2b51e55..fc59e42 100644
> --- a/gitweb/static/js/javascript-detection.js
> +++ b/gitweb/static/js/javascript-detection.js
> @@ -60,6 +60,11 @@ function fixLink(link) {
>         */
>        var jsExceptionsRe = /[;?]js=[01]$/;
>
> +       // don't change links marked as external ($logo_url, $home_link)
> +       if (link.className === 'external') {
> +               return;
> +       }
> +
>        if (!jsExceptionsRe.test(link)) { // =~ /[;?]js=[01]$/;
>                link.href +=
>                        (link.href.indexOf('?') === -1 ? '?' : ';') + 'js=1';
>
>
>



-- 
http://blog.blackdown.de/
http://www.flickr.com/photos/jkreileder/

^ permalink raw reply

* Re: [PATCH 11/13] strbuf: add strbuf_add*_urlencode
From: Jeff King @ 2011-11-29 21:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vzkfessff.fsf@alter.siamese.dyndns.org>

On Tue, Nov 29, 2011 at 10:19:00AM -0800, Junio C Hamano wrote:

> > +static int is_rfc3986_reserved(char ch)
> > +{
> > +	switch (ch) {
> > +	case '!': case '*': case '\'': case '(': case ')': case ';':
> > +	case ':': case '@': case '&': case '=': case '+': case '$':
> > +	case ',': case '/': case '?': case '#': case '[': case ']':
> > +		return 1;
> > +	}
> > +	return 0;
> > +}
> 
> Part of me wonders if we still have extra bits in sane_ctype[] array but
> that one is cumbersome to update, and the above should be easier to read
> and maintain.

We have 2 bits left. I did consider it, but it just seemed excessively
cumbersome for something that really doesn't need to be that fast (if it
is indeed any faster than this case statement).

> > +void strbuf_add_urlencode(struct strbuf *sb, const char *s, size_t len,
> > +			  int reserved)
> 
> Does "reserved" parameter mean "must-encode-reserved", or
> "may-encode-reserved" (the latter would be more like "if set to 0,
> per-cent encoding the result would be an error")?

It is "must-encode-reserved". The difference, from my reading of the
rfc, is that we can relax our encoding in the path-name portion of the
URI. For example, in:

  https://user@host/path/to/repo.git

You definitely want to quote "/" in the user or hostname, but doing so
in path/to/repo.git is just annoying.

-Peff

^ permalink raw reply

* Re: [PATCH 03/13] introduce credentials API
From: Jeff King @ 2011-11-29 21:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vmxbeu91d.fsf@alter.siamese.dyndns.org>

On Tue, Nov 29, 2011 at 09:34:54AM -0800, Junio C Hamano wrote:

> >> > +`credential_fill`::
> >> > +
> >> > +	Attempt to fill the username and password fields of the passed
> >> > +	credential struct, first consulting storage helpers, then asking
> >> > +	the user. Guarantees that the username and password fields will
> >> > +	be filled afterwards (or die() will be called).

> Immensely, at least to me. From the perspective of a user of the API, a
> call to credential_fill() is to "fill in the credential" in the sense that
> "call the function to fill in the credential" but I find it easier to
> understand if it were explained to me as "ask the API to fill in the
> credential, which may involve helpers to interact with the user--the point
> of the API is that the caller does not care how it is done".  Same for the
> reject/accept calls---the example makes it clear that they are to tell the
> decision to reject/accept made by the application to the credential API,
> and it is up to the API layer what it does using that decision (like
> removing the cached and now stale password).

Ahh, I see your confusion now. It is not that the description is
necessarily lacking any content, but that the tense of the sentences is
misleading. I'll fix that.

> The above example is a bit too simplistic and misleading, though. You
> would call reject only on authentication failure (do not trash stored and
> good password upon network being unreachable temporarily or the server
> being overloaded).

Good point. I'll fix that and add the example to the documentation.

> > So one possible rule would be:
> >
> >   1. If it starts with "!", clip off the "!" and hand it to the shell.
> >
> >   2. Otherwise, if is_absolute_path(), hand it to the shell directly.
> >
> >   3. Otherwise, prepend "git credential-" and hand it to the shell.
> >
> > I think that is slightly less confusing than the "first word is alnum"
> > thing.
> 
> Simpler and easier to explain. Good ;-)

OK, I'll implement that, then.

> > How do you feel about the "values cannot contain a newline" requirement?
> 
> In the context of asking username, password, or passphrase, I think "LF is
> the end of the line and you cannot have that byte in your response" is
> perfectly reasonable. I've yet to find a way to use LF in a passphrase to
> unlock my Gnome keychain ;-).

The potential issue is that other values get that, too. So if you have a
URL with "\n" in the path, it cannot be transmitted verbatim. We can
url-encode, of course, but I didn't want the helpers to have to deal
with quoting issues.

> >> Two style nits.
> >
> > I'm supposed to guess? ;P
> 
> Sorry, but you guessed right.

OK, will fix.

-Peff

^ permalink raw reply

* Re: [PATCH] Implement fast hash-collision detection
From: Jeff King @ 2011-11-29 20:59 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Bill Zaumen, git, gitster, spearce, torvalds
In-Reply-To: <CACsJy8DboVU4kSbJSV=8NP08OyLYVgOKsm8tt=koZ0=JcGSE=A@mail.gmail.com>

On Tue, Nov 29, 2011 at 09:04:06PM +0700, Nguyen Thai Ngoc Duy wrote:

> > You can fix this by including an extra header in the signed part of the
> > tag that says "also, the digest of the commit I point to is X". Then you
> > know you have the same commit that Linus had. But the commit points to a
> > tree by its sha1. So you have to add a similar header in the commit
> > object that says "also, the digest of the tree I point to is X". And
> > ditto for all of the parent pointers, if you want to care about signing
> > history. And then you have the same problem in the tree: each sub-tree
> > and blob is referenced by its sha1.
> >
> Can we just hash all objects in a pack from bottom up, (replacing
> sha-1 in trees/commits/tags with the new digest in memory before
> hashing), then attach the new top digest to tag's content? The sender
> is required by the receiver to send new digests for all objects in the
> pack together with the pack. The receiver can then go through the same
> process to produce the top digest and match it with one saved in tag.

I think that is conflating two different layers of git. The security for
tags happens at the conceptual object db layer: you sign a tag, and that
points to a commit, which points to a tree, and so on. The authenticity
comes from the tag signature, but the integrity of each link in the
chain is verifiable because of the has property. The pack layer, on the
other hand, is just an implementation detail about how those conceptual
objects are stored. More than just your tag will be in a pack, and the
contents of your tag may be spread across several packs (or even loose
objects).

So I don't think it's right to talk about packs at all in the signature
model.

If you wanted to say "make a digest of all of the sub-objects pointed to
by the tag", then yes, that does work (security-wise). But it's
expensive to calculate. Instead, you want to use a "digest of digests"
as much as possible. Which is what git already does, of course; you hash
the tree object, which contains hashes of the blob sha1s. Git's
conceptual model is fine. The only problem is that sha1 is potentially
going to lose its security properties, weakening the links in the chain.
So as much as possible, we want to insert additional links at the exact
same places, but using a stronger algorithm.

Does that make sense?

-Peff

^ permalink raw reply

* Re: support gnupg-2.x in git.
From: Junio C Hamano @ 2011-11-29 20:29 UTC (permalink / raw)
  To: Paweł Sikora; +Cc: git
In-Reply-To: <201111291937.34324.pawel.sikora@agmk.net>

Paweł Sikora <pawel.sikora@agmk.net> writes:

> i'm using a gnupg-2.0.18 and currently i'm not able to use git tag/verify
> due to hadcoded "gpg" literals in builtin/{tag,verifiy-tag}.c.

Stating the obvious...

  $ ln -s /usr/local/not/on/my/path/bin/gnupg-2.0.18 $HOME/bin/gpg
  $ PATH=$HOME/bin:$PATH

Or this untested patch, which applies on top of jc/signed-commit, as the
GnuPG interface is in the process of getting heavily refactored.

-- >8 --
Subject: gpg-interface: allow use of a custom GPG binary

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config.txt  |   11 +++++++++++
 Documentation/git-tag.txt |    8 +++++---
 gpg-interface.c           |   11 ++++++++---
 3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index b30c7e6..094c1c9 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1094,6 +1094,17 @@ grep.lineNumber::
 grep.extendedRegexp::
 	If set to true, enable '--extended-regexp' option by default.
 
+gpg.program::
+	Use this custom program instead of "gpg" found on $PATH when
+	making or verifying a PGP signature. The program must support the
+	same command line interface as GPG, namely, to verify a detached
+	signature, "gpg --verify $file - <$signature" is run, and the
+	program is expected to signal a good signature by exiting with
+	code 0, and to generate an ascii-armored detached signature, the
+	standard input of "gpg -bsau $key" is fed with the contents to be
+	signed, and the program is expected to send the result to its
+	standard output.
+
 gui.commitmsgwidth::
 	Defines how wide the commit message window is in the
 	linkgit:git-gui[1]. "75" is the default.
diff --git a/Documentation/git-tag.txt b/Documentation/git-tag.txt
index c83cb13..74fc7e0 100644
--- a/Documentation/git-tag.txt
+++ b/Documentation/git-tag.txt
@@ -38,7 +38,9 @@ created (i.e. a lightweight tag).
 A GnuPG signed tag object will be created when `-s` or `-u
 <key-id>` is used.  When `-u <key-id>` is not used, the
 committer identity for the current user is used to find the
-GnuPG key for signing.
+GnuPG key for signing. 	The configuration variable `gpg.program`
+is used to specify custom GnuPG binary.
+
 
 OPTIONS
 -------
@@ -48,11 +50,11 @@ OPTIONS
 
 -s::
 --sign::
-	Make a GPG-signed tag, using the default e-mail address's key
+	Make a GPG-signed tag, using the default e-mail address's key.
 
 -u <key-id>::
 --local-user=<key-id>::
-	Make a GPG-signed tag, using the given key
+	Make a GPG-signed tag, using the given key.
 
 -f::
 --force::
diff --git a/gpg-interface.c b/gpg-interface.c
index ff232c8..18630ff 100644
--- a/gpg-interface.c
+++ b/gpg-interface.c
@@ -5,6 +5,7 @@
 #include "sigchain.h"
 
 static char *configured_signing_key;
+static const char *gpg_program = "gpg";
 
 void set_signing_key(const char *key)
 {
@@ -15,9 +16,12 @@ void set_signing_key(const char *key)
 int git_gpg_config(const char *var, const char *value, void *cb)
 {
 	if (!strcmp(var, "user.signingkey")) {
+		set_signing_key(value);
+	}
+	if (!strcmp(var, "gpg.program")) {
 		if (!value)
 			return config_error_nonbool(var);
-		set_signing_key(value);
+		gpg_program = xstrdup(value);
 	}
 	return 0;
 }
@@ -46,7 +50,7 @@ int sign_buffer(struct strbuf *buffer, struct strbuf *signature, const char *sig
 	gpg.argv = args;
 	gpg.in = -1;
 	gpg.out = -1;
-	args[0] = "gpg";
+	args[0] = gpg_program;
 	args[1] = "-bsau";
 	args[2] = signing_key;
 	args[3] = NULL;
@@ -101,10 +105,11 @@ int verify_signed_buffer(const char *payload, size_t payload_size,
 			 struct strbuf *gpg_output)
 {
 	struct child_process gpg;
-	const char *args_gpg[] = {"gpg", "--verify", "FILE", "-", NULL};
+	const char *args_gpg[] = {NULL, "--verify", "FILE", "-", NULL};
 	char path[PATH_MAX];
 	int fd, ret;
 
+	args_gpg[0] = gpg_program;
 	fd = git_mkstemp(path, PATH_MAX, ".git_vtag_tmpXXXXXX");
 	if (fd < 0)
 		return error("could not create temporary file '%s': %s",

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox