From mboxrd@z Thu Jan 1 00:00:00 1970 From: A Large Angry SCM Subject: Git files data formats documentation Date: Fri, 04 Aug 2006 22:39:25 -0700 Message-ID: <44D42F0D.3040707@gmail.com> Reply-To: gitzilla@gmail.com Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------080705040702010403050202" X-From: git-owner@vger.kernel.org Sat Aug 05 07:39:39 2006 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1G9Eso-0007Sr-Qz for gcvg-git@gmane.org; Sat, 05 Aug 2006 07:39:35 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932624AbWHEFjb (ORCPT ); Sat, 5 Aug 2006 01:39:31 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932626AbWHEFjb (ORCPT ); Sat, 5 Aug 2006 01:39:31 -0400 Received: from py-out-1112.google.com ([64.233.166.178]:30411 "EHLO py-out-1112.google.com") by vger.kernel.org with ESMTP id S932624AbWHEFj3 (ORCPT ); Sat, 5 Aug 2006 01:39:29 -0400 Received: by py-out-1112.google.com with SMTP id z74so896239pyg for ; Fri, 04 Aug 2006 22:39:28 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:disposition-notification-to:date:from:reply-to:user-agent:x-accept-language:mime-version:to:subject:content-type; b=Lel7EQdFjLC+MQgZMNbv2B1h44JDmJzEaC35VfAUovNtc1vnviUJto9V4OLWRVP+lw5du/4JBh8urdxKqTmeUW177rOo/A+yavrAHScjbPn6vUrFhtJey1lw3WEmg6Kakz0UD2YhLmsJwSSX5Ss5XfrBEuCbbWfwJh1+Cki22P8= Received: by 10.35.113.12 with SMTP id q12mr6550215pym; Fri, 04 Aug 2006 22:39:28 -0700 (PDT) Received: from ?10.0.0.6? ( [70.32.187.210]) by mx.gmail.com with ESMTP id 14sm1468643nzp.2006.08.04.22.39.26; Fri, 04 Aug 2006 22:39:27 -0700 (PDT) User-Agent: Mozilla Thunderbird 1.0.8 (X11/20060411) X-Accept-Language: en-us, en To: git@vger.kernel.org Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: This is a multi-part message in MIME format. --------------080705040702010403050202 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit This information may be useful for reading and writing the various Git files. --------------080705040702010403050202 Content-Type: text/plain; name="dataformats.txt" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="dataformats.txt" Git files data formats ====================== OBJECTS ------- # The object ID, or "name", of an object is # _sha-1_digest_( ). : _deflate_( ) | _deflate_( ) ; : ; : _deflate_( ) | _deflate_( ) ; : ; # Tree entries are sorted by the byte sequence that comprises # the entry name. : ( )* ; # The type of the object referenced MUST be appropriate for # the mode. Regular files and symbolic links reference a BLOB # and directories reference a TREE. : ; : _deflate_( ) | _deflate_( ) ; : "tree" ( "parent" )* "author" "committer" ; : _deflate_( ) | _deflate_( ) ; : "object" "type" "tag" ; : ; # The object type DELTA_ENCODED is not valid in a # . : ; # Uninterpreted sequence of bytes. ; # Octal encoding, without prefix, of the file system object # type and permission bits. The bit layout is according to the # POSIX standard, with only regular files, directories, and # symbolic links permitted. The actual permission bits are # all zero except for regular files. The only permission bit # of any consequence to Git is the owner executable bit. By # default, the permission bits for files will be either 0644 # or 0755, depending on the owner executable bit. ; # Sequence of bytes not containing the ASCII character byte # value NUL (0x00). ; # The object ID of the referenced object. ; # Hexidecimal encoding (lower case) of the . ; : ; : ; # A sequence of bytes not containing the ASCII character byte # values NUL (0x00), LF (0x0a), '<' (0c3c), or '>' (0x3e). # # The sequence may not begin or end with any bytes with the # following ASCII character byte values: SPACE (0x20), # '.' (0x2e), ',' (0x2c), ':' (0x3a), ';' (0x3b), '<' (0x3c), # '>' (0x3e), '"' (0x22), "'" (0x27). ; : ; # Base 10, ASCII encoding of the number of seconds since 12:00 # midnight January 1, 1970, UTC without accounting for leap # seconds, and without leading zeros. ; # Signed offset of time zone from UTC. : ; : "+" | "-" ; : ; # Valid values are "00" to "59" inclusive. : ; # ASCII decimal digit. ; : "BLOB" | "TREE" | "COMMIT" ; : | "TAG" ; # Base 10, ASCII encoding of the byte length of the object # contents, without leading zeros. The length value does not # include the length of the . : ( )+ ; # ASCII SPACE (0x20) character. ; # ASCII NUL (0x00) character. ; # ASCII LF (0x0a) "line feed" character. ; PACK FILE --------- # The name of a pack file is "pack-${PACK_ID}.pack", where ${PACK_ID} # is the hexidecimal encoding (lower case) of the SHA-1 digest of the # sorted list of binary object IDs in the pack file without a separator # between the object IDs. Initially, the ${PACK_ID} for a pack was not # defined, making the value effectively random. : ; : "PACK" ( )* ; # 32 bit, network byte order, binary integer indicating which # version of the pack file format was used to create the pack # file. ; # 32 bit, network byte order, binary integer containg the # number of objects encoded in the pack file. ; : _sha-1_digest_( ) ; # If the object type is not a DELTA_ENCODED object, the packed # object data that follows is the deflated byte sequence of the # object without the Git object header. The length value is the # byte count of the inflated byte sequence of the object. # # If the object type is a DELTA_ENCODED object, what follows is # the ID of the base object and the deflated delta data to # transform the base object into the target object. The type of # the target object is the same as that of the base object and # the length value is the byte count of the inflated delta # data. The base object may also be DELTA_ENCODED but cyclic # base object chains are not permitted and the pack file MUST # contain all base objects. : ; # A compact, variable length, encoding of the packed object # length and type. The first byte is comprised of 3 fields # (where bit 0 is the least significant bit in a byte): # bit 7: more flag # bits 6-4: object type # bits 3-0: least significant bits of the object # length. # If the more flag is set, the next byte contains more object # length bits. # The object types corresponding to the object type bits are: # 6 5 4 # - - - # 0 0 0 invalid: Reserved # 0 0 1 COMMIT object # 0 1 0 TREE object # 0 1 1 BLOB object # 1 0 0 TAG object # 1 0 1 invalid: Reserved # 1 1 0 invalid: Reserved # 1 1 1 DELTA_ENCODED object # # If the more flag was set, the next byte will have more length # bits and will be comprised to 2 fields: # bit 7: more flag # bits 6-0: 7 additional, more significant, bits of # the object length # If the more flag is set, the next byte contains more object # length bits using the same encoding. ; : _deflate_( ) | _deflate_( ) ; # Size of the base object encoded as a base 128 number, least # significant bits first, using bit 7 (the most significant # bit) of each byte to indicate that more bits follow. # # Size of the result object encoded as a base 128 number, using # the same method as used for the base object size. # # There will then be a sequence of delta hunks. # Zero as the value of the first byte of a hunk in reserved. # # If bit 7 of the first byte of a delta hunk is not set, the # hunk is an "insert" hunk and bits 0-6 specify the number of # bytes to append to the output buffer from the hunk. # # If bit 7 of the first byte of a delta hunk is set, the hunk # is a "copy" hunk and bits 0-6 specify how the remaining # bytes in the hunk make up the base offset and length for the # copy. The following C code demonstrate how to determine the # base offset and length for the copy: # # /* - - - - - - - - - - - - *\ # | This reflects version 3 pack files | # \* - - - - - - - - - - - - */ # # byte *data = delta_hunk_start # opcode = *data++ # off_t copy_offset= 0; # size_t copy_length = 0; # # for (shift=i=0; i<4; i++) { # if (opcode & 0x01) { # copy_offset |= (*data++)<>= 1; # shift += 8; # } # # for (shift=i=0; i<3; i++) { # if (opcode & 0x01) { # copy_length |= (*data++)<>= 1; # shift += 8; # } # # if (!copy_length) { # copy_length = 1<<16; # } # # For version 2 pack files, the size of a copy is limited to # 64K bytes or less and bit 6 of the opcode byte is set if the # source of the copy is from the buffer of the result object # instead of the the base object. # # It's unknown if any version 2 pack files were created with # bit 6 set in the opcode byte; however, the change that added # support for version 3 pack files removed the code that would # change the copy source to the result buffer. # # /* - - - - - - - - - - - - *\ # | This reflects version 2 pack files | # \* - - - - - - - - - - - - */ # # byte *data = delta_hunk_start # opcode = *data++ # off_t copy_offset= 0; # size_t copy_length = 0; # # for (shift=i=0; i<4; i++) { # if (opcode & 0x01) { # copy_offset |= (*data++)<>= 1; # shift += 8; # } # # for (shift=i=0; i<2; i++) { # if (opcode & 0x01) { # copy_length |= (*data++)<>= 1; # shift += 8; # } # # if (!copy_length) { # copy_length = 1<<16; # } # # copy_from_result = opcode & 0x01 # ; PACK INDEX ---------- # The name of a pack file index is "pack-${PACK_ID}.idx", where # ${PACK_ID} is the hexidecimal encoding (lower case) of the SHA-1 # digest of the sorted list of binary object IDs in the pack file # without a separator between the object IDs. Initially, the ${PACK_ID} # for a pack was not defined, making the value effectively random. : ; : ( ){256} ( )* ; # 32 bit, network byte order, binary integer of the count of # objects in the pack file with the first byte of the object # ID less than or equal to the index of the count, starting # from zero. ; # 32 bit, network byte order, binary integer giving the offset, # in bytes from the begining of the pack file, where the # encoding of the object starts. ; : _sha-1_digest_( ) ; INDEX FILE (CACHE) ------------------ : | ; # This format is no longer supported. : ; : ; : "DIRC" ; # 32 bit, network byte order, binary integer indicating which # version of the index file format was used to create the # index file. ; # 32 bit, network byte order, binary integer containg the # number of index entries in the index file. ; : _sha-1_digest_( ) ; : _sha-1_digest_( ) ; : ( )* ; : ; : ; # These fields are used as a part of a heuristic to determine # if the file system entity associated with this entry has # changed. The names are very *nix centric but the exact # contents of each field have no meaning to Git, besides exact # match, except for the and fields. : ; # The timestamp of the last status change of the associated # file system entity. : ; # The timestamp of the last modification of the associated # file system entity. : ; : ; # 32 bit, network byte order, binary integer containg the lower # 32 bits of the entry (file or symbolic link) timestamp. ; # 32 bit, network byte order, binary integer containg the lower # 32 bits of the entry (file or symbolic link) more precise # timestamp, if available. ; # 32 bit, network byte order, binary integer containg the lower # 32 bits of the entry (file or symbolic link) file system # device identifier. Use of this field is a compile time # option. ; # 32 bit, network byte order, binary integer containg the lower # 32 bits of the entry (file or symbolic link) inode number, or # equivalent. ; # 32 bit, network byte order, binary integer containg the lower # 32 bits of the entry (file or symbolic link) file system # entity type and permissions. ; # 32 bit, network byte order, binary integer containg the lower # 32 bits of the entry (file or symbolic link) file system # entity owner identifier. ; # 32 bit, network byte order, binary integer containg the lower # 32 bits of the entry (file or symbolic link) file system # entity group identifier, or equivalent. ; # 32 bit, network byte order, binary integer containg the lower # 32 bits of the entry (file or symbolic link) size. ; # Object ID of the of the file system entity contents. ; # 16 bit, network byte order, binary integer. # bits 15-14 Reserved # bits 13-12 Entry stage # bits 11-0 Name byte length # # See git-read-tree(1) for a description of how the stage # field is used. ; # File system entity name. Path is normalized and relative to # the working directory. ; : ( )* ; : ; : ; # 4 byte sequence identifying how the # should be interpreted. The first byte having a value greater # than or equal to the ASCII character 'A' (0x41) and less than # or equal to the ASCII character 'Z' (0x5a). ; # 32 bit, network byte order, binary integer containg the # length of the byte sequence. ; # Sequence of bytes. ; --------------080705040702010403050202--