From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Brian O'Mahoney" Subject: The git repo format Date: Wed, 27 Apr 2005 21:47:04 +0200 Message-ID: <426FEC38.1060507@khandalf.com> References: <426F2671.1080105@zytor.com> <426FD3EE.5000404@zytor.com> Reply-To: omb@bluewin.ch Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Git Mailing List X-From: git-owner@vger.kernel.org Wed Apr 27 21:43:03 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DQsPW-0006Tt-27 for gcvg-git@gmane.org; Wed, 27 Apr 2005 21:41:26 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261981AbVD0Tqt (ORCPT ); Wed, 27 Apr 2005 15:46:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261983AbVD0Tqt (ORCPT ); Wed, 27 Apr 2005 15:46:49 -0400 Received: from mxout.hispeed.ch ([62.2.95.247]:9198 "EHLO smtp.hispeed.ch") by vger.kernel.org with ESMTP id S261981AbVD0Tqj (ORCPT ); Wed, 27 Apr 2005 15:46:39 -0400 Received: from khandalf.com (80-218-57-125.dclient.hispeed.ch [80.218.57.125]) (authenticated bits=0) by smtp.hispeed.ch (8.12.6/8.12.6/tornado-1.0) with ESMTP id j3RJkbdE003772 for ; Wed, 27 Apr 2005 21:46:37 +0200 Received: from [127.0.0.1] (localhost [127.0.0.1]) by teraflex.teraflex-research.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j3RJl7Mr021462 for ; Wed, 27 Apr 2005 21:47:09 +0200 User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050317) X-Accept-Language: en-us, en In-Reply-To: X-Enigmail-Version: 0.90.2.0 X-Enigmail-Supports: pgp-inline, pgp-mime X-Md5-Body: c244a13e51b8700b3575ccede9897781 X-Transmit-Date: Wednesday, 27 Apr 2005 21:47:14 +0200 X-Message-Uid: 0000b49cec9d61680000000200000000426fec420005e032000000010008d704 Replyto: omb@bluewin.ch X-Sender-Postmaster: Postmaster@80-218-57-125.dclient.hispeed.ch. X-Virus-Scanned: ClamAV version 0.83, clamav-milter version 0.83 on smtp-02.tornado.cablecom.ch X-Virus-Status: Clean To: unlisted-recipients:; (no To-header on input) Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org In understanding how to work with 'git' I had a number of initial difficulties which are mostly covered by the e-mail from Linus below. Most of these are already covered in the README: for objects, ie blob, commit, tag, tree: inflate, then \s\0 where is in the form, described by Linus below when you look at them closely, all the formats are simple, un-ambiguous, and very easy to parse. The index is also easy to parse, but there is a detail, after the 3-int header the records are padded to a multiple of 8 bytes. The detail is in cache.h. Maybe the README needs to re-inforce this. Brian > I repeat: git does not do any free-form parsin AT ALL. ======================================================== The links are in well-defined places, and you do not ever search for them. And that's really very very important. > For a "commit", the format is > > - first line is exactly 46 bytes: five bytes of "tree ", 40 bytes of hex > sha1, and one byte of "\n". > > NOTHING ELSE. Not extra spaces at the end, not extra spaces at the > beginning or the middle. It's ASCII, but it's not free-format ASCII. > > - the next (where 'n' can be 0 or more) lines are _exactly_ 48 bytes > each: seven bytes of "parent ", 40 bytes of hex sha1, and one byte of > "\n". > > NOTHING ELSE. > > - the next lines are "author " and "committer ". They have well-defined > delimters for their fields, and no sha1's. The fields cannot contain > '<', '>' or newlines, since those are the field/line delimeters. > > There is no free-format text _anywhere_ that git parses. No room for > guesses, no room for mistakes, no room for anything half-way questionable. > > And fsck actually enforces this. We do _not_ just use "gets()" to read one > line at a time. We literally verify that the lines are 46/48 bytes long, > and have the delimeters in the expected places. > > Same goes for "tree" and "tag" objects. They all have fixed-format stuff. > A "tree" entry is always > > "%o %s" \0 [ 20 bytes of sha1 ] > > with "%o" being "mode", and "%s" being "path". We don't guess. > > And this really is _important_. Exactly because we name things by the SHA1 > hash of the contents, we MUST NOT have flexible formats. Having a format > which allows non-canonical representations (extra spaces etc) would mean > that two trees that were identical would depend on how you happened to > format them. > > So there's really two issues: > - we don't guess or parse contents. We have strict rules, and that makes > git more reliable. There are no gray areas. There's "right" and there > is "wrong", and the right one works, and the wrong one gets flagged as > being wrong and the tools refuse to touch it. > - there is only _one_ right way to do things, and that means that the > the content is well-defined, and thus the SHA1 of the content is > well-defined. > > For example, another rule is that a "tree" object is always sorted by > the bytes in the filename (not by entry, btw: a directory called "foo" > will sort as "foo/", even though the _entry_ only shows "foo"). That rule > not only makes a lot of operations faster, but again, it means that there > is only _one_ way to represent a tree validly. > > IOW, you _cannot_ represent a tree any other way (and I've been too lazy > to check this in fsck, but it's alway sbeen my plan), and that is exactly > why we can just compare the hashes of the results - because there is no > random component of "layout" in the contents. > > This really is important. It means that if you get to the same two tree > contents in totally unrelated ways (you unpack a tar-file and encode it in > git, or you have 5 years of git history and check it out), the "tree" will > match _exactly_. There's no history. There's no "optional" stuff. Since > the contents of the trees are the same, the SHA1 of the two trees will be > the same. Exactly because git refuses to touch any free-format stuff.