From: Junio C Hamano <junkio@cox.net>
To: git@vger.kernel.org
Cc: Kees-Jan Dijkzeul <k.j.dijkzeul@gmail.com>,
Linus Torvalds <torvalds@osdl.org>
Subject: Re: Cygwin can't handle huge packfiles?
Date: Fri, 07 Apr 2006 01:15:47 -0700 [thread overview]
Message-ID: <7vhd55ls24.fsf@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: Pine.LNX.4.64.0604030734440.3781@g5.osdl.org
Linus Torvalds <torvalds@osdl.org> writes:
> On Mon, 3 Apr 2006, Linus Torvalds wrote:
>>
>> That said, I think git _does_ have problems with large pack-files. We have
>> some 32-bit issues etc
>
> I should clarify that. git _itself_ shouldn't have any 32-bit issues, but
> the packfile data structure does. The index has 32-bit offsets into
> individual pack-files.
>
> That's not hugely fundamental,...
Linus _does_ understand what he means, but let me clarify and
outline a possible future direction.
* pack-*.pack file has the following format:
- The header appears at the beginning and consists of the following:
4-byte signature
4-byte version number (network byte order)
4-byte number of objects contained in the pack (network byte order)
Observation: we cannot have more than 4G versions ;-) and
more than 4G objects in a pack.
- The header is followed by number of object entries, each of
which looks like this:
(undeltified representation)
n-byte type and length (4-bit type, (n-1)*7+4-bit length)
compressed data
(deltified representation)
n-byte type and length (4-bit type, (n-1)*7+4-bit length)
20-byte base object name
compressed delta data
Observation: length of each object is encoded in a variable
length format and is not constrained to 32-bit or anything.
- The trailer records 20-byte SHA1 checksum of all of the above.
* pack-*.idx file has the following format:
- The header consists of 256 4-byte network byte order
integers. N-th entry of this table records the number of
objects in the corresponding pack, the first byte of whose
object name are smaller than N.
Observation: we would need to extend this to an array of
8-byte integers to go beyond 4G objects per pack, but it is
not strictly necessary.
- The header is followed by sorted 28-byte entries, one entry
per object in the pack. Each entry is:
4-byte network byte order integer, recording where the
object is stored in the packfile as the offset from the
beginning.
20-byte object name.
Observation: we would definitely need to extend this to
8-byte integer plus 20-byte object name to handle a packfile
that is larger than 4GB.
- The file is concluded with a trailer:
A copy of the 20-byte SHA1 checksum at the end of
corresponding packfile.
20-byte SHA1-checksum of all of the above.
This is not fundamental, in that pack idx file is something we
can regenerate from a packfile. The push/fetch transfer over
git native protocols does not even transfer pack idx file;
instead, the recipient uses git-index-pack to generate pack idx.
git-index-pack would need to be updated to update the necessary
fields to 8-byte integers, without breaking existing packfiles.
The code to read idx file currently has a sanity check logic to
make sure that the size of the idx file is consistent with
24-byte entries (the last entry in the header matches the number
of objects recorded in the pack). So we could reliably tell
between the current 24-byte version and 28-byte "beyond 4GB"
version, and support both formats at the same time.
Even after we start supporting the 28-byte "beyond 4GB" format,
we can and we should continue writing the current 24-byte
version of pack idx file when the packfile offset can be
expressed with 32-bit.
Having said that, I have to warn that this is not for weak of
heart. The necessary changes would be somewhat involved.
----------------------------------------------------------------
Pack idx file
idx
+--------------------------------+
| fanout[0] = 2 |-.
+--------------------------------+ |
| fanout[1] | |
+--------------------------------+ |
| fanout[2] | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| fanout[255] | |
+--------------------------------+ |
main | offset | |
index | object name 00XXXXXXXXXXXXXXXX | |
table +--------------------------------+ |
| offset | |
| object name 00XXXXXXXXXXXXXXXX | |
+--------------------------------+ |
.-| offset |<+
| | object name 01XXXXXXXXXXXXXXXX |
| +--------------------------------+
| | offset |
| | object name 01XXXXXXXXXXXXXXXX |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| | offset |
| | object name FFXXXXXXXXXXXXXXXX |
| +--------------------------------+
trailer | | packfile checksum |
| +--------------------------------+
| | idxfile checksum |
| +--------------------------------+
.-------.
|
Pack file entry: <+
packed object header:
1-byte type (bit 4-6)
size0 (bit 0-3)
end-of-length (bit 7)
n-byte sizeN (as long as MSB is set, each 7-bit)
size0..sizeN form 4+7+7+..+7 bit integer, size0
is the most significant part.
packed object data:
If it is not DELTA, then deflated bytes (the size above
is the size before compression).
If it is DELTA, then
20-byte base object name SHA1 (the size above is the
size of the delta data that follows).
delta data, deflated.
next prev parent reply other threads:[~2006-04-07 8:16 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-03 9:46 Cygwin can't handle huge packfiles? Kees-Jan Dijkzeul
2006-04-03 13:23 ` Johannes Schindelin
2006-04-03 14:26 ` Morten Welinder
2006-04-03 14:33 ` Linus Torvalds
2006-04-03 14:36 ` Linus Torvalds
2006-04-05 13:24 ` Kees-Jan Dijkzeul
2006-04-05 14:14 ` Johannes Schindelin
2006-04-05 21:08 ` Christopher Faylor
2006-04-05 23:27 ` Rutger Nijlunsing
2006-04-06 0:34 ` Christopher Faylor
2006-04-06 4:13 ` Junio C Hamano
2006-04-07 8:15 ` Junio C Hamano [this message]
2006-04-07 8:27 ` Jakub Narebski
2006-04-07 14:11 ` Nicolas Pitre
2006-04-07 18:31 ` Junio C Hamano
2006-04-07 18:46 ` Nicolas Pitre
2006-04-03 15:12 ` Johannes Schindelin
2006-04-03 14:38 ` Alex Riesen
-- strict thread matches above, loose matches on Subject: below --
2006-04-06 20:57 linux
2006-04-06 23:53 ` Junio C Hamano
2006-04-07 3:05 ` linux
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7vhd55ls24.fsf@assigned-by-dhcp.cox.net \
--to=junkio@cox.net \
--cc=git@vger.kernel.org \
--cc=k.j.dijkzeul@gmail.com \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.