* Understanding version 4 packs
@ 2007-03-24 20:23 Peter Eriksen
2007-03-24 23:24 ` Nicolas Pitre
2007-03-25 8:46 ` Shawn O. Pearce
0 siblings, 2 replies; 19+ messages in thread
From: Peter Eriksen @ 2007-03-24 20:23 UTC (permalink / raw)
To: git
Hello Shawn (and Nicolas and other interested parties),
I have been reading the commits in the
git://repo.or.cz/git/fastimport.git/ repository (git makes it quite easy
to see what differs from mainline using "git log master..pack4"), and I
think, I have understood some of the details.
The easiest thing to get was the file name table, which is placed in the
beginning of the pack (after the header) using the format:
+------------+-------------------------------+
| NR_ENTRIES | Compressed file name table |
+------------+-------------------------------+
4 bytes
The uncompressed file name table contains NR_ENTRIES entries,
and looks like this:
+------+--------------+------+------------------------+----
| MODE | Full path 1 | MODE | Full path 2 | ...
+------+--------------+------+------------------------+----
2 bytes n1 bytes 2 bytes n2 bytes
The table is sorted by path then mode for easy binary lookup, and so
that pointers into this table can be compared directly instead of
comparing the corresponding paths and modes.
There is a new tree type called OBJ_DICT_TREE, which looks something
like the following:
+-----------------+------------------------------------------------+----
| Table offset | SHA-1 of the blob corresponding to the path. | ...
+-----------------+------------------------------------------------+----
6 bytes 20 bytes
These new tree objects will remain uncompressed in the pack file, but
sorted with, and deltaed against other tree objects. All normal tree
objects are converted to OBJ_DICT_TREE when packing, and are converted
back on the fly to callers who need an ordinary OBJ_TREE.
The index (.idx) files are extended to have a 4 byte pointer to the
offset of this file name table in the pack file for easy lookup.
There is something similar with a table of common strings in commit
objects (e.g. author and timezone), and a new object OBJ_DICT_COMMIT,
but I have not understood that quite yet.
Is there something, I have gotten wrong with regards to my
understanding?
Regards,
Peter
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Understanding version 4 packs 2007-03-24 20:23 Understanding version 4 packs Peter Eriksen @ 2007-03-24 23:24 ` Nicolas Pitre 2007-03-25 8:35 ` Peter Eriksen 2007-03-25 8:46 ` Shawn O. Pearce 1 sibling, 1 reply; 19+ messages in thread From: Nicolas Pitre @ 2007-03-24 23:24 UTC (permalink / raw) To: Peter Eriksen; +Cc: git On Sat, 24 Mar 2007, Peter Eriksen wrote: > There is a new tree type called OBJ_DICT_TREE, which looks something > like the following: > > +-----------------+------------------------------------------------+---- > | Table offset | SHA-1 of the blob corresponding to the path. | ... > +-----------------+------------------------------------------------+---- > 6 bytes 20 bytes Actually it is a 2-byte index in the path table, and a 4-byte index in a common SHA1 table. So each tree entry is 6 bytes total. > These new tree objects will remain uncompressed in the pack file, but > sorted with, and deltaed against other tree objects. All normal tree > objects are converted to OBJ_DICT_TREE when packing, and are converted > back on the fly to callers who need an ordinary OBJ_TREE. Right. > The index (.idx) files are extended to have a 4 byte pointer to the > offset of this file name table in the pack file for easy lookup. Right. And it will lose the SHA1 entries since they are already available in the pack. > There is something similar with a table of common strings in commit > objects (e.g. author and timezone), and a new object OBJ_DICT_COMMIT, > but I have not understood that quite yet. > > Is there something, I have gotten wrong with regards to my > understanding? I don't think so. Note that the code is still a work in progress and the resulting pack/index is not yet fully conform to the format we envisaged. Nicolas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-24 23:24 ` Nicolas Pitre @ 2007-03-25 8:35 ` Peter Eriksen 2007-03-25 9:18 ` Shawn O. Pearce 0 siblings, 1 reply; 19+ messages in thread From: Peter Eriksen @ 2007-03-25 8:35 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git On Sat, Mar 24, 2007 at 07:24:17PM -0400, Nicolas Pitre wrote: > On Sat, 24 Mar 2007, Peter Eriksen wrote: > > > There is a new tree type called OBJ_DICT_TREE, which looks something > > like the following: > > > > +-----------------+------------------------------------------------+---- > > | Table offset | SHA-1 of the blob corresponding to the path. | ... > > +-----------------+------------------------------------------------+---- > > 6 bytes 20 bytes > > Actually it is a 2-byte index in the path table, and a 4-byte index in a > common SHA1 table. So each tree entry is 6 bytes total. What happens to the paths, that do not have a correponding entry in the path name table, because they are not among the 65535 most frequent paths in the pack? > > The index (.idx) files are extended to have a 4 byte pointer to the > > offset of this file name table in the pack file for easy lookup. > > Right. And it will lose the SHA1 entries since they are already > available in the pack. Does this mean, that the current index format will change from: - The header is followed by sorted 24-byte entries, one entry per object in the pack. Each entry is: 4-byte network byte order integer, recording where the object is stored in the packfile as the offset from the beginning. to just 4-byte entries, and are the SHA-1 entries in that extra table of SHA-1's referenced by OBJ_DICT_TREE objects in the pack file? Regards, Peter P.S. I have updated my description of the pack format. Any comments are welcome. On disk format of version 4 packs (v0.1) ================================= There is a file name table, EXT_OBJ_FILENAME_TABLE, which is placed anywhere in the pack file, but before any OBJ_DICT_TREE objects, which are referencing the table, so that the pack can be easily streamed. It is using the format: +-------------------------------+ | Compressed file name table | +-------------------------------+ The uncompressed file name table contains NR_ENTRIES entries, and looks like this: +------------+------+--------------+------+--------------------+---- | NR_ENTRIES | MODE | Full path 1 | MODE | Full path 2 | ... +------------+------+--------------+------+--------------------+---- 4 bytes 2 bytes n1 bytes 2 bytes n2 bytes MODE is a network-byte-order integer representing the mode of the path, and the path is a variable length, null-terminated string. The table is sorted by path then mode for easy binary lookup, and so that pointers into this table can be compared directly instead of comparing the corresponding paths and modes. This table contains the 65535 most used paths in the entire pack. There is a new tree type called OBJ_DICT_TREE, which looks like the following: +--------+----------------+---- | P offs | SHA-1 offs | ... +--------+----------------+---- 2 bytes 4 bytes That is, each entry contains a 2-byte index into the path table, and a corresponding 4-byte index into a SHA-1 table. These new tree objects will remain uncompressed in the pack file, but sorted with, and deltaed against other tree objects. All normal tree objects are converted to OBJ_DICT_TREE when packing, and are converted back on the fly to callers who need an ordinary OBJ_TREE. The index (.idx) files are extended to have a 4 byte pointer to the offset of this file name table in the pack file for easy lookup. There is something similar with a table, EXT_OBJ_IDENT_TABLE of common strings in commit objects (e.g. author and timezone), and a new object OBJ_DICT_COMMIT, but I have not understood that quite yet. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-25 8:35 ` Peter Eriksen @ 2007-03-25 9:18 ` Shawn O. Pearce 2007-03-25 17:09 ` Linus Torvalds 2007-03-26 12:16 ` Marco Costalba 0 siblings, 2 replies; 19+ messages in thread From: Shawn O. Pearce @ 2007-03-25 9:18 UTC (permalink / raw) To: Peter Eriksen; +Cc: Nicolas Pitre, git Peter Eriksen <s022018@student.dtu.dk> wrote: > On Sat, Mar 24, 2007 at 07:24:17PM -0400, Nicolas Pitre wrote: > > On Sat, 24 Mar 2007, Peter Eriksen wrote: > > > > > There is a new tree type called OBJ_DICT_TREE, which looks something > > > like the following: > > > > > > +-----------------+------------------------------------------------+---- > > > | Table offset | SHA-1 of the blob corresponding to the path. | ... > > > +-----------------+------------------------------------------------+---- > > > 6 bytes 20 bytes > > > > Actually it is a 2-byte index in the path table, and a 4-byte index in a > > common SHA1 table. So each tree entry is 6 bytes total. > > What happens to the paths, that do not have a correponding entry in the > path name table, because they are not among the 65535 most frequent > paths in the pack? They don't appear in the table. And any tree that uses them is forced to use the "legacy" OBJ_TREE encoding. Which is what we have now in pack v2, and in loose objects. > > > The index (.idx) files are extended to have a 4 byte pointer to the > > > offset of this file name table in the pack file for easy lookup. > > > > Right. And it will lose the SHA1 entries since they are already > > available in the pack. > > Does this mean, that the current index format will change from: > > - The header is followed by sorted 24-byte entries, one entry > per object in the pack. Each entry is: > > 4-byte network byte order integer, recording where the > object is stored in the packfile as the offset from the > beginning. > > to just 4-byte entries, and are the SHA-1 entries in that extra table > of SHA-1's referenced by OBJ_DICT_TREE objects in the pack file? or 8 byte entries (for 64 bit offsets, handling larger files). But yes. We will also still store the fan-out table at the start of the index, as its very useful at runtime when reading the packfile for random access, but isn't required for accurate encoding/decoding of the data in the packfile. > On disk format of version 4 packs (v0.1) > ================================= > > There is a file name table, EXT_OBJ_FILENAME_TABLE, which is placed > anywhere in the pack file, but before any OBJ_DICT_TREE objects, which > are referencing the table, so that the pack can be easily streamed. It > is using the format: > > +-------------------------------+ > | Compressed file name table | > +-------------------------------+ > > The uncompressed file name table contains NR_ENTRIES entries, > and looks like this: > > +------------+------+--------------+------+--------------------+---- > | NR_ENTRIES | MODE | Full path 1 | MODE | Full path 2 | ... > +------------+------+--------------+------+--------------------+---- > 4 bytes 2 bytes n1 bytes 2 bytes n2 bytes > > MODE is a network-byte-order integer representing the mode of the path, > and the path is a variable length, null-terminated string. Yes so far. > The table is sorted by path then mode for easy binary lookup, and so > that pointers into this table can be compared directly instead of > comparing the corresponding paths and modes. This table contains the > 65535 most used paths in the entire pack. See my prior email about the sorting. But yes. > There is a new tree type called OBJ_DICT_TREE, which looks like the > following: > > +--------+----------------+---- > | P offs | SHA-1 offs | ... > +--------+----------------+---- > 2 bytes 4 bytes See my prior email; there's also that pesky record count at the start. > That is, each entry contains a 2-byte index into the path table, and a > corresponding 4-byte index into a SHA-1 table. > > These new tree objects will remain uncompressed in the pack file, but > sorted with, and deltaed against other tree objects. All normal tree > objects are converted to OBJ_DICT_TREE when packing, and are converted > back on the fly to callers who need an ordinary OBJ_TREE. Yup, but see my prior email as there's also the rule that OBJ_DICT_TREE cannot delta against an OBJ_TREE (or vice-versa). > The index (.idx) files are extended to have a 4 byte pointer to the > offset of this file name table in the pack file for easy lookup. > > There is something similar with a table, EXT_OBJ_IDENT_TABLE of common > strings in commit objects (e.g. author and timezone), and a new object > OBJ_DICT_COMMIT, but I have not understood that quite yet. OBJ_DICT_COMMIT is rather simple: - stored uncompressed, like OBJ_DICT_TREE +---------+------+-------+------------+-------------+-------------+----- | RAW_LEN | tree | flags | parents... | commit_time | author_time | ... +---------+------+-------+------------+-------------+-------------+----- vint idref 1 byte idref * n 4 bytes 4 bytes Here RAW_LEN is the total length of this commit when its in its standard raw format, the one that is used to compute the SHA-1. This helps the decoder when we need to recreate a normal commit. We store this a vint just because we can. The tree and parent idrefs are currently full 20-byte SHA-1s, but these are likely to change to 4 byte SHA-1 indexes like in an OBJ_DICT_TREE. The flags field is actually 3 fields crammed into 1 byte: flags & 128 == if set, the author_time == commit_time and the author_time field is not present in the stream; flags & 64 == if set, the author == committer and the author ident field is not present in the stream; flags & 63 == number of parent idrefs (n above). May be 0. I'm actually considering making this flags & 31, leaving ourselves a spare bit for the future. Why? You can't make a commit with more than 16 parents right now. Now after the n parent idrefs (again, 20-byte SHA-1 but could also be the 4 byte SHA-1 indexes) we always have the commit_time field, and optionally the author_time field (if flags & 128 == 0). [sidenote: after re-reading this, I don't like the definition of flags & 128 == 1 implying there is 4 bytes *less* data in the stream. Every other place within Git we use a bit set to mean *more* data follows, and a bit not set to mean *less* data follows. pack v4 is backwards here, and that's wrong.] The commit_time field is a 4 byte big-endian seconds-since-epoch thing. We're actually saying the high-bit must not be set here, leaving that room for future expansion. We may just later redefine it to mean an unsigned time_t, or to mean its variable length encoded, or... ;-) Why commit_time, and why before the idents? Because if you look at our revision walking code we care about commit_time to sort commits in struct commit_list. Making it early where we can get to it fast helps the commit walker skip through commits it doesn't want to show. The author_time field is not present if flags & 128 is true. If flags & 128 is false, its present, and uses the same encoding as commit_time. Why is this field optional? Because its not uncommon for it to match commit_time! ;-) ----+-----------+--------+--------------------- ... | committer | author | deflated_message ... ----+-----------+--------+--------------------- vint vint Now to finish out the object we have the committer as a variable length integer index into EXTOBJ_IDENT_TABLE. The author is the same, except its optional and is only present if flags & 64 is false. Why? Again, because it is commonf for author == committer in many projects. The remainder of the buffer is the zlib deflated message. Now the message is tricky. When inflated it actually usually starts with an LF. Why? In 'raw' format of a commit we consider the end of the header lines and the start of the message itself to be a single blank line. But there can be additional headers beyond tree/parent/author/committer. Like what? The newer encoding header! So commits that have no encoding header have their inflated message starting with an LF. Commits that actually used the encoding header have their inflated message starting with 'encoding '. So we can tell if there are additional headers (or not) in a given commit by looking at the inflated message to see if the first character is an LF, or not. This format allows us to store any additional headers that might get developed, while still enjoying the benefits of the EXTOBJ_DICT_COMMIT encoding for the headers that are currently somewhat important to Git. -- Shawn. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-25 9:18 ` Shawn O. Pearce @ 2007-03-25 17:09 ` Linus Torvalds 2007-03-25 20:31 ` Shawn O. Pearce 2007-03-26 12:16 ` Marco Costalba 1 sibling, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2007-03-25 17:09 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Peter Eriksen, Nicolas Pitre, git On Sun, 25 Mar 2007, Shawn O. Pearce wrote: > > > > What happens to the paths, that do not have a correponding entry in the > > path name table, because they are not among the 65535 most frequent > > paths in the pack? > > They don't appear in the table. And any tree that uses them is > forced to use the "legacy" OBJ_TREE encoding. Which is what we > have now in pack v2, and in loose objects. Would it hurt too much to just make it four bytes, and avoid that issue? Special cases - and *especially* special cases that are hard to trigger in the first place - equal bugs. And bugs are much much worse than trying to save a little bit of space. > The author_time field is not present if flags & 128 is true. > If flags & 128 is false, its present, and uses the same encoding > as commit_time. Why is this field optional? Because its not > uncommon for it to match commit_time! ;-) If the author time is the same as the commit time, most of the time the author is the same as the committer too, no? So the field should be conditional not for the author_time, but for the combination, no? Our email-parsing tools (which is the most common reason for a committer not being the same as the author) all take the author date from the email. So I don't think author_time == committer_time except when the committer and the author are one and the same person. Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-25 17:09 ` Linus Torvalds @ 2007-03-25 20:31 ` Shawn O. Pearce 2007-03-26 1:12 ` Nicolas Pitre 0 siblings, 1 reply; 19+ messages in thread From: Shawn O. Pearce @ 2007-03-25 20:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Peter Eriksen, Nicolas Pitre, git Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 25 Mar 2007, Shawn O. Pearce wrote: > > > > > > What happens to the paths, that do not have a correponding entry in the > > > path name table, because they are not among the 65535 most frequent > > > paths in the pack? > > > > They don't appear in the table. And any tree that uses them is > > forced to use the "legacy" OBJ_TREE encoding. Which is what we > > have now in pack v2, and in loose objects. > > Would it hurt too much to just make it four bytes, and avoid that issue? > > Special cases - and *especially* special cases that are hard to trigger in > the first place - equal bugs. And bugs are much much worse than trying to > save a little bit of space. Worth exploring. When I get back to rebasing that topic onto Junio's tree I'll try a 4 byte index and see what kind of damage it does on space on large projects (Mozilla, linux-2.6, Eclipse). You may be right, an 8 byte record may just be worth the cost. > > The author_time field is not present if flags & 128 is true. > > If flags & 128 is false, its present, and uses the same encoding > > as commit_time. Why is this field optional? Because its not > > uncommon for it to match commit_time! ;-) > > If the author time is the same as the commit time, most of the time the > author is the same as the committer too, no? So the field should be > conditional not for the author_time, but for the combination, no? Excellent observation. I'll make that change at the same time that I fix the meaning of flags & 128 to mean "more data follows". Thanks! -- Shawn. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-25 20:31 ` Shawn O. Pearce @ 2007-03-26 1:12 ` Nicolas Pitre 2007-03-26 2:02 ` Shawn O. Pearce 0 siblings, 1 reply; 19+ messages in thread From: Nicolas Pitre @ 2007-03-26 1:12 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Linus Torvalds, Peter Eriksen, git On Sun, 25 Mar 2007, Shawn O. Pearce wrote: > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, 25 Mar 2007, Shawn O. Pearce wrote: > > > > > > > > What happens to the paths, that do not have a correponding entry in the > > > > path name table, because they are not among the 65535 most frequent > > > > paths in the pack? > > > > > > They don't appear in the table. And any tree that uses them is > > > forced to use the "legacy" OBJ_TREE encoding. Which is what we > > > have now in pack v2, and in loose objects. > > > > Would it hurt too much to just make it four bytes, and avoid that issue? > > > > Special cases - and *especially* special cases that are hard to trigger in > > the first place - equal bugs. And bugs are much much worse than trying to > > save a little bit of space. > > Worth exploring. When I get back to rebasing that topic onto > Junio's tree I'll try a 4 byte index and see what kind of damage > it does on space on large projects (Mozilla, linux-2.6, Eclipse). > You may be right, an 8 byte record may just be worth the cost. Maybe simply 3 bytes might be a good compromise too. I doubt a single pack is ever to contain 4G paths since it is limited to 4G _objects_ in the first place. Another approach is to have the path index field width as the first item in such an object. This way it can be scalled as needed. BTW Shawn there is no need to store the number of tree records at the beginning of the tree object since that can be deduced directly from the object size stored in the object header. Nicolas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-26 1:12 ` Nicolas Pitre @ 2007-03-26 2:02 ` Shawn O. Pearce 2007-03-26 8:49 ` Jakub Narebski 0 siblings, 1 reply; 19+ messages in thread From: Shawn O. Pearce @ 2007-03-26 2:02 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Peter Eriksen, git Nicolas Pitre <nico@cam.org> wrote: > Maybe simply 3 bytes might be a good compromise too. I doubt a single > pack is ever to contain 4G paths since it is limited to 4G _objects_ in > the first place. 16M paths is also a lot. ;-) > BTW Shawn there is no need to store the number of tree records at the > beginning of the tree object since that can be deduced directly from the > object size stored in the object header. Doh. Yes, of course. -- Shawn. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-26 2:02 ` Shawn O. Pearce @ 2007-03-26 8:49 ` Jakub Narebski 2007-03-26 14:01 ` Nicolas Pitre 0 siblings, 1 reply; 19+ messages in thread From: Jakub Narebski @ 2007-03-26 8:49 UTC (permalink / raw) To: git Shawn O. Pearce wrote: > Nicolas Pitre <nico@cam.org> wrote: >> BTW Shawn there is no need to store the number of tree records at the >> beginning of the tree object since that can be deduced directly from the >> object size stored in the object header. > > Doh. Yes, of course. But if it makes for easier _implementation_, perhaps it should stay... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-26 8:49 ` Jakub Narebski @ 2007-03-26 14:01 ` Nicolas Pitre 0 siblings, 0 replies; 19+ messages in thread From: Nicolas Pitre @ 2007-03-26 14:01 UTC (permalink / raw) To: Jakub Narebski; +Cc: git On Mon, 26 Mar 2007, Jakub Narebski wrote: > Shawn O. Pearce wrote: > > > Nicolas Pitre <nico@cam.org> wrote: > > >> BTW Shawn there is no need to store the number of tree records at the > >> beginning of the tree object since that can be deduced directly from the > >> object size stored in the object header. > > > > Doh. Yes, of course. > > But if it makes for easier _implementation_, perhaps it should stay... No. I don't think a division by 6 is that much of an implementation issue. Nicolas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-25 9:18 ` Shawn O. Pearce 2007-03-25 17:09 ` Linus Torvalds @ 2007-03-26 12:16 ` Marco Costalba 2007-03-26 14:27 ` Nicolas Pitre 1 sibling, 1 reply; 19+ messages in thread From: Marco Costalba @ 2007-03-26 12:16 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Peter Eriksen, Nicolas Pitre, git On 3/25/07, Shawn O. Pearce <spearce@spearce.org> wrote: > Peter Eriksen <s022018@student.dtu.dk> wrote: > > On Sat, Mar 24, 2007 at 07:24:17PM -0400, Nicolas Pitre wrote: > > > On Sat, 24 Mar 2007, Peter Eriksen wrote: > > > > > > > The uncompressed file name table contains NR_ENTRIES entries, > > and looks like this: > > > > +------------+------+--------------+------+--------------------+---- > > | NR_ENTRIES | MODE | Full path 1 | MODE | Full path 2 | ... > > +------------+------+--------------+------+--------------------+---- > > 4 bytes 2 bytes n1 bytes 2 bytes n2 bytes > > > > MODE is a network-byte-order integer representing the mode of the path, > > and the path is a variable length, null-terminated string. > > Yes so far. > Perhaps has been already evaluated and my comment is not pertinent but, anyway... Experimenting with file names cache in qgit I have found a big saving splitting the paths in base name and file name and indexing both: drivers\usb\host\ehci.h drivers\usb\host\ehci-pci.c drivers\usb\host\ohci-pci.c kernel\sched.c became: dir names table 0 drivers\usb\host 1 kernel file name table 0 ehci.h 1 ehci-pci.c 2 ohci-pci.c In this way a big saving is achieved in case of directories deep in the tree (long paths) and a lot of files. Also after compressing the difference is noticeable. Regarding MODE field an observation could be that is almost always the same, so an idea could be to store a 'default mode' just after nr_entries and do not add the field any more except in case path mode is different from default mode. In case this could bring to unaligned entries another idea could be to store _all_ mode fields at the beginning (or at the end and let deflate to remove almost everything more easily) Marco ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-26 12:16 ` Marco Costalba @ 2007-03-26 14:27 ` Nicolas Pitre 2007-03-26 17:10 ` Marco Costalba 0 siblings, 1 reply; 19+ messages in thread From: Nicolas Pitre @ 2007-03-26 14:27 UTC (permalink / raw) To: Marco Costalba; +Cc: Shawn O. Pearce, Peter Eriksen, git On Mon, 26 Mar 2007, Marco Costalba wrote: > Experimenting with file names cache in qgit I have found a big saving > splitting the paths in base name and file name and indexing both: > > drivers\usb\host\ehci.h > drivers\usb\host\ehci-pci.c > drivers\usb\host\ohci-pci.c > kernel\sched.c > > became: > > dir names table > > 0 drivers\usb\host > 1 kernel > > > file name table > > 0 ehci.h > 1 ehci-pci.c > 2 ohci-pci.c > > In this way a big saving is achieved in case of directories deep in > the tree (long paths) and a lot of files. Sure, but if you also consider drivers/usb/Makefile and drivers/Kconfig for example then you start losing on space saving. Maybe that makes sense for qgit but it has no advantage in a pack which contains every possible files. > Regarding MODE field an observation could be that is almost always the > same, so an idea could be to store a 'default mode' just after > nr_entries and do not add the field any more except in case path mode > is different from default mode. If the mode is always the same, or most likely similar for many entries then it will compress very well. In fact in the current table format the tree byte sequence NULL+16-bit-mode will be quite common and likely to deflate accordingly. This is therefore not worth adding more complex handling at runtime for deciding which mode to use, and still you'd have to store a flag for each path component to decide if the default mode should be used or not anyway. > In case this could bring to unaligned > entries another idea could be to store _all_ mode fields at the > beginning (or at the end and let deflate to remove almost everything > more easily) That's worth trying indeed. Nicolas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-26 14:27 ` Nicolas Pitre @ 2007-03-26 17:10 ` Marco Costalba 2007-03-26 18:15 ` Nicolas Pitre 2007-03-26 18:43 ` Nicolas Pitre 0 siblings, 2 replies; 19+ messages in thread From: Marco Costalba @ 2007-03-26 17:10 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn O. Pearce, Peter Eriksen, git On 3/26/07, Nicolas Pitre <nico@cam.org> wrote: > On Mon, 26 Mar 2007, Marco Costalba wrote: > > > Experimenting with file names cache in qgit I have found a big saving > > splitting the paths in base name and file name and indexing both: > > > > drivers\usb\host\ehci.h > > drivers\usb\host\ehci-pci.c > > drivers\usb\host\ohci-pci.c > > kernel\sched.c > > > > became: > > > > dir names table > > > > 0 drivers\usb\host > > 1 kernel > > > > > > file name table > > > > 0 ehci.h > > 1 ehci-pci.c > > 2 ohci-pci.c > > > > In this way a big saving is achieved in case of directories deep in > > the tree (long paths) and a lot of files. > > Sure, but if you also consider drivers/usb/Makefile and drivers/Kconfig > for example then you start losing on space saving. In your example you'd have: drivers/usb/Makefile drivers/Kconfig became dir names table 0 drivers 1 drivers/usb file name table 0 Makefile 1 Kconfig I fail to see wher's the losing on space saving. More, you probably have many paths both under 'drivers' and 'drivers/usb' and for each added path it would be possible to avoid to store the prefix ('driver' or 'driver/usb'). To better clarify, OBJ_DICT_TREE data *currently* looks like: +------------+-------+-------+-------+-------+---- | NR_ENTRIES | name1 | hash1 | name2 | hash2 | ... +------------+-------+-------+-------+-------+---- vint 2 bytes 4 bytes 2 bytes 4 bytes where name1 is an index into the packfile's sole EXTOBJ_FILENAME_TABLE. The possible improve is to define OBJ_DICT_TREE like +------------+-------+-------+-------+-------+---- | NR_ENTRIES | dir1 | fiile1 | hash1| dir 2| fiile2|... +------------+-------+-------+-------+-------+---- vint 2 bytes 2 bytes 2 bytes 4 bytes where dir1 is an index into a new EXTOBJ_DIRNAME_TABLE and file1 is an index in a new EXTOBJ_FILENAME_TABLE. EXTOBJ_FILENAME_TABLE is defined as the currently (but much smaller in size!!) and keeps only the file names, not the full paths, while EXTOBJ_DIRNAME_TABLE is defined as EXTOBJ_FILENAME_TABLE but without MODE field (associated to files only) and is used to store the dir names. Decopuling dir names from file names could improve saving space because the length of proposed EXTOBJ_FILENAME_TABLE + EXTOBJ_DIRNAME_TABLE < current EXTOBJ_FILENAME_TABLE. Marco P.S: Of course now you'd save 2+2 bytes in OBJ_DICT_TREE instead of 2 for 'name' index. To avoid this and keep the idea of decopuling dir and file names an still use 2 bytes in OBJ_DICT_TREE a possible layout of EXTOBJ_FILENAME_TABLE could be: +------------+------+-------+-----------------+--- -+----------------+-------+------+----------+ | NR_ENTRIES | dirA | file name1 | ofs1| file name2 | ofs 2|dirB |file name3 | ofs3 | .... +------------+------+-------+-----------------+---- +---------------+--------+------+----------+ Where ofs1 and ofs2 are 2-bytes values pointing to dirA, ofs3 points to dirB and so on. Where the tree layout of the above example is: dirA \ file name1 dirA \ file name2 dirB \ file name3 With this approach you have both the saving in case of directories with many files and still 2 bytes per 'name' index in OBJ_DICT_TREE (that points to 'file name' field). This approach saves space as soon as directory names are longer then 2 chars. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-26 17:10 ` Marco Costalba @ 2007-03-26 18:15 ` Nicolas Pitre 2007-03-26 18:43 ` Nicolas Pitre 1 sibling, 0 replies; 19+ messages in thread From: Nicolas Pitre @ 2007-03-26 18:15 UTC (permalink / raw) To: Marco Costalba; +Cc: Shawn O. Pearce, Peter Eriksen, git On Mon, 26 Mar 2007, Marco Costalba wrote: > On 3/26/07, Nicolas Pitre <nico@cam.org> wrote: > > On Mon, 26 Mar 2007, Marco Costalba wrote: > > > > > Experimenting with file names cache in qgit I have found a big saving > > > splitting the paths in base name and file name and indexing both: > > > > > > drivers\usb\host\ehci.h > > > drivers\usb\host\ehci-pci.c > > > drivers\usb\host\ohci-pci.c > > > kernel\sched.c > > > > > > became: > > > > > > dir names table > > > > > > 0 drivers\usb\host > > > 1 kernel > > > > > > > > > file name table > > > > > > 0 ehci.h > > > 1 ehci-pci.c > > > 2 ohci-pci.c > > > > > > In this way a big saving is achieved in case of directories deep in > > > the tree (long paths) and a lot of files. > > > > Sure, but if you also consider drivers/usb/Makefile and drivers/Kconfig ^^^^ > In your example you'd have: > > drivers/usb/Makefile > drivers/Kconfig No. "also" was the key word here. Nicolas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-26 17:10 ` Marco Costalba 2007-03-26 18:15 ` Nicolas Pitre @ 2007-03-26 18:43 ` Nicolas Pitre 2007-03-27 6:46 ` Marco Costalba 1 sibling, 1 reply; 19+ messages in thread From: Nicolas Pitre @ 2007-03-26 18:43 UTC (permalink / raw) To: Marco Costalba; +Cc: Shawn O. Pearce, Peter Eriksen, git On Mon, 26 Mar 2007, Marco Costalba wrote: > I fail to see wher's the losing on space saving. More, you probably > have many paths both under 'drivers' and 'drivers/usb' and for each > added path it would be possible to avoid to store the prefix ('driver' > or 'driver/usb'). I'm under the impression you don't understand how tree objects work. > To better clarify, OBJ_DICT_TREE data *currently* looks like: > > +------------+-------+-------+-------+-------+---- > | NR_ENTRIES | name1 | hash1 | name2 | hash2 | ... > +------------+-------+-------+-------+-------+---- > vint 2 bytes 4 bytes 2 bytes 4 bytes > > where name1 is an index into the packfile's sole EXTOBJ_FILENAME_TABLE. Exact. > The possible improve is to define OBJ_DICT_TREE like > > +------------+-------+-------+-------+-------+---- > | NR_ENTRIES | dir1 | fiile1 | hash1| dir 2| fiile2|... > +------------+-------+-------+-------+-------+---- > vint 2 bytes 2 bytes 2 bytes 4 bytes > > where dir1 is an index into a new EXTOBJ_DIRNAME_TABLE and file1 is an > index in a new EXTOBJ_FILENAME_TABLE. You definitely don't understand how tree objects are used. Tree objects have no notion of full path at all. They only contain directory component from a single path level only. If you have the following files: drivers/Kconfig drivers/usb/Makefile drivers/usb/host/ehci.h drivers/usb/host/ehci-pci.c drivers/usb/host/ohci-pci.c kernel/sched.c then you'll start with one tree objects for the root directory that contains: drivers (tree) kernel (tree) Then a second tree object for the "drivers" directory that contains: Kconfig (blob) usb (tree) Then a third tree object for the "usb" directory with: Makefile (blob) host (tree) Then the fourth tree object with: ehci.h (blob) ehci-pci.c (blob) ohci-pci.c (blob) And finally a fifth tree object for the "kernel" directory with: sched.c (blob) Hence, the path component table would contain: drivers usb host Kconfig Makefile ehci.h ehci-pci.c ohci-pci.c sched.c along with the mode bits for each of those path components, and this is what the new tree object would index into for each tree record. > EXTOBJ_FILENAME_TABLE is defined as the currently (but much smaller in > size!!) and keeps only the file names, not the full paths, while > EXTOBJ_DIRNAME_TABLE is defined as EXTOBJ_FILENAME_TABLE but without > MODE field (associated to files only) and is used to store the dir > names. > > Decopuling dir names from file names could improve saving space > because the length of proposed EXTOBJ_FILENAME_TABLE + > EXTOBJ_DIRNAME_TABLE < current EXTOBJ_FILENAME_TABLE. I hope the explanation above made it clear that what you're proposing cannot ever be smaller than current EXTOBJ_FILENAME_TABLE. Nicolas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-26 18:43 ` Nicolas Pitre @ 2007-03-27 6:46 ` Marco Costalba 2007-03-27 6:55 ` Shawn O. Pearce 0 siblings, 1 reply; 19+ messages in thread From: Marco Costalba @ 2007-03-27 6:46 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn O. Pearce, Peter Eriksen, git On 3/26/07, Nicolas Pitre <nico@cam.org> wrote: > On Mon, 26 Mar 2007, Marco Costalba wrote: > > Hence, the path component table would contain: > > drivers > usb > host > Kconfig > Makefile > ehci.h > ehci-pci.c > ohci-pci.c > sched.c > > along with the mode bits for each of those path components, and this is > what the new tree object would index into for each tree record. > Now I understand. Just a question. So getting full paths does it requires some additional work? Marco ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-27 6:46 ` Marco Costalba @ 2007-03-27 6:55 ` Shawn O. Pearce 0 siblings, 0 replies; 19+ messages in thread From: Shawn O. Pearce @ 2007-03-27 6:55 UTC (permalink / raw) To: Marco Costalba; +Cc: Nicolas Pitre, Peter Eriksen, git Marco Costalba <mcostalba@gmail.com> wrote: > On 3/26/07, Nicolas Pitre <nico@cam.org> wrote: > >Hence, the path component table would contain: > > > > drivers > > usb > > host > > Kconfig > > Makefile > > ehci.h > > ehci-pci.c > > ohci-pci.c > > sched.c > > > >along with the mode bits for each of those path components, and this is > >what the new tree object would index into for each tree record. > > Just a question. So getting full paths does it requires some additional > work? No. Why? Because Git already makes the full path by taking individual path components from each tree object and joins them together (adding a "/" between each component) before displaying it to an application, or loading the path into the index file. This is because of the fundemental (and quite nice!) structure of a tree. -- Shawn. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-24 20:23 Understanding version 4 packs Peter Eriksen 2007-03-24 23:24 ` Nicolas Pitre @ 2007-03-25 8:46 ` Shawn O. Pearce 2007-03-25 9:40 ` Shawn O. Pearce 1 sibling, 1 reply; 19+ messages in thread From: Shawn O. Pearce @ 2007-03-25 8:46 UTC (permalink / raw) To: Peter Eriksen; +Cc: git Peter Eriksen <s022018@student.dtu.dk> wrote: > I have been reading the commits in the > git://repo.or.cz/git/fastimport.git/ repository (git makes it quite easy > to see what differs from mainline using "git log master..pack4"), and I > think, I have understood some of the details. Just to be clear, that branch is strictly a proposed prototype of what a pack version 4 *might* look like. Absolutely nothing has been set into stone for that file format. A good chunk of that code needs to be reworked just to get it merged onto Junio's current 'master' as Nico and myself have been doing a number of cleanups and bug fixes in some of the affected areas. ;-) > The easiest thing to get was the file name table, which is placed in the > beginning of the pack (after the header) using the format: That's not true. The filename table (EXTOBJ_FILENAME_TABLE) may appear at any position within the packfile (but like all objects it must appear somewhere after position 12, as that is where the header ends). Now to help out the unpackers (index-pack and unpack-objects) we have the convention that this table is written out before the first OBJ_DICT_TREE. That way the unpacker can load the table and have it ready to go when it sees the first OBJ_DICT_TREE. If we didn't have this rule the unpackers would need to hang into all OBJ_DICT_TREEs they see until they get the EXTOBJ_FILENAME_TABLE, then they could actually process those pending OBJ_DICT_TREEs. This is somewhat expensive on memory, and is just ugly to code. So what you will find is that the EXTOBJ_FILENAME_TABLE is dumped out behind all of the commits, but before the first OBJ_DICT_TREE, and since all trees tend to get converted to an OBJ_DICT_TREE, the EXTOBJ_FILENAME_TABLE is sandwiched exactly between the commits and the trees. Since the unpackers will probably never be smart enough to handle an OBJ_DICT_TREE before an EXTOBJ_FILENAME_TABLE, its likely that we'll just have the file format requirement that the EXTOBJ_FILENAME_TABLE must appear before the first OBJ_DICT_TREE, but can otherwise appear at any position in the file. The reason we put the EXTOBJ_FILENAME_TABLE behind the commits is we often walk the commit chains (following parent pointers) without looking at the trees at all. Consider `git log`, in the default settings we don't need the trees. By keeping the filename table behind the commits the OS read-ahead buffering gets a better chance at loading all of the data we need, and none of the data we don't. So that's why its where it is. > +------------+-------------------------------+ > | NR_ENTRIES | Compressed file name table | > +------------+-------------------------------+ > 4 bytes No. A string table object (both the EXTOBJ_FILENAME_TABLE and the EXTOBJ_IDENT_TABLE) has its uncompressed size stored in the standard "size" field within the object header. (This lets us malloc the proper buffer quickly.) Immediately behind that object header is the deflated table. The deflated table looks like: +------------+-------+-------+-------+--------+----+ | NR_ENTRIES | MODE1 | str1 | MODE2 | str2 | ... +------------+-------+-------+-------+--------+----+ 4 bytes 2 bytes n1 2 bytes n2 The field NR_ENTRIES is in big-endian (network) byte order. Each MODE field is also in big-endian byte order. Each string is null terminated. The lengths n1 and n2 in the diagram above would include the null terminating byte. There is no end-of-table marker; the way to know the you reach the end of a table is by counting NR_ENTRIES records out. I did consider making NR_ENTRIES a vint (variable length int), but decided against it for the sake of simplicity. ;-) For starters its much easier to just treat the darn thing as a 32 bit value and use ntohl. Its also easier in the pack-objects code, as I can reserve that space at the front of the table, as the size is fixed. Also, since its is actually inside of the deflated zlib stream, and null bytes are very common (string terminaters) any unnecessary leading nulls will probably compress quite well, as the null byte will probably get a relatively short encoding in the compressed stream. The MODE fields are the standard POSIX mode bits in an EXTOBJ_FILENAME_TABLE. In an EXTOBJ_IDENT_TABLE the MODE fields actually store the preferred timezone offset (hours in the first/high byte, minutes in the second/low byte) of the user whose name/email is stored in the string field. > The table is sorted by path then mode for easy binary lookup, and so > that pointers into this table can be compared directly instead of > comparing the corresponding paths and modes. Uh, not quite. In the case of EXTOBJ_FILENAME_TABLE we sort by name+type using the messy base_name_compare. In this sorting string entries whose mode match S_ISDIR (are directory modes) sort as though their name ends with "/" (even though they actually don't). If there is a tie, we break the tie by sorting by the mode alone. In the case of EXTOBJ_IDENT_TABLE we plan to sort by frequency of occurance only. This sorting puts the most frequent users at the start of the table, allowing us to reference the top 128 authors and committers in just 1 byte, and the next 16,257 top authors and committers in just 2 bytes (as we use vints to index into here, more later). > There is a new tree type called OBJ_DICT_TREE, which looks something > like the following: > > +-----------------+------------------------------------------------+---- > | Table offset | SHA-1 of the blob corresponding to the path. | ... > +-----------------+------------------------------------------------+---- > 6 bytes 20 bytes No. As Nico stated the records of an OBJ_DICT_TREE are actually only 6 bytes each. Actually an OBJ_DICT_TREE is *not* comprssed in the packfile. I want to stress this point, as its unlike most other object types where the data after the header is just a zlib stream. Its data looks like: +------------+-------+-------+-------+-------+---- | NR_ENTRIES | name1 | hash1 | name2 | hash2 | ... +------------+-------+-------+-------+-------+---- vint 2 bytes 4 bytes 2 bytes 4 bytes The NR_ENTRIES field is our "standard" variable integer encoding (the encoding used by OBJ_OFS_DELTA). It tells us how many tree entries to expect. name1 is an index into the packfile's sole EXTOBJ_FILENAME_TABLE. hash1 is an index into the packfile's sole SHA1 table. This object type hasn't been declared yet, but will be. Both fields are in big-endian / network byte order. > These new tree objects will remain uncompressed in the pack file, but > sorted with, Yes, correct. > and deltaed against other tree objects. *only* against other OBJ_DICT_TREEs. If a tree could not be converted to an OBJ_DICT_TREE then it stays as an OBJ_TREE and only deltas against other OBJ_TREEs. > All normal tree > objects are converted to OBJ_DICT_TREE when packing, Almost. We try to convert all trees to OBJ_DICT_TREE when packing, but we cannot do so if the EXTOBJ_FILENAME_TABLE does not contain one or more path/mode pairs required by that tree. This can happen if the EXTOBJ_FILENAME_TABLE would need to contain more than 2**16 entries, as the index into that table (name1 above) is strictly a 16 bit unsigned value. Thus we have a rule in pack-objects where we first sort the EXTOBJ_FILENAME_TABLE by frequency, clipping it to the top 2**16 entries, then we resort it according to the name+mode sort. > and are converted > back on the fly to callers who need an ordinary OBJ_TREE. Yes. But we don't want to actually do that. One of our goals is to adjust tree-walk.c (and if needed its callers) to directly handle an OBJ_DICT_TREE. This way we can avoid a lot of costly decompression. Further I think we can play a game with the delta encoder and delta apply routines where we can even avoid applying OBJ_DICT_TREE deltas when we are walking the tree; instead we can walk the deltas directly. This is the primary motiviation for keeping the OBJ_DICT_TREE format a fixed width record, even if it might waste a tiny amount of space for some projects. > The index (.idx) files are extended to have a 4 byte pointer to the > offset of this file name table in the pack file for easy lookup. Yes. But these may become 64 bit offsets, to allow for very large packfiles. > There is something similar with a table of common strings in commit > objects (e.g. author and timezone), and a new object OBJ_DICT_COMMIT, > but I have not understood that quite yet. Its actually EXTOBJ_DICT_COMMIT. The idea here is that author and committer strings appear very commonly thoughout a project. Look at Junio for example in git.git, there are more than 3,000 commits with his name on them. These compress rather poorly, and don't delta against each other very well at all. By pulling these common strings out to an EXTOBJ_IDENT_TABLE we can save some space. The other idea is to store the tree and the parent commits in pure binary (so SHA-1s are 20 bytes, not 40 bytes hex) and to avoid text headers, so that we can parse the important fields of a commit that are needed for revision walking immediately from the raw pack data. Since SHA-1s are uncompressable we aren't actually losing any disk space here either. Actually in my early experiements (predates the packv4 code you looked at) this was saving about 63 bytes per commit. > Is there something, I have gotten wrong with regards to my > understanding? You're close. Not bad for no documentation! ;-) -- Shawn. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Understanding version 4 packs 2007-03-25 8:46 ` Shawn O. Pearce @ 2007-03-25 9:40 ` Shawn O. Pearce 0 siblings, 0 replies; 19+ messages in thread From: Shawn O. Pearce @ 2007-03-25 9:40 UTC (permalink / raw) To: Peter Eriksen; +Cc: git "Shawn O. Pearce" <spearce@spearce.org> wrote: > So what you will find is that the EXTOBJ_FILENAME_TABLE is dumped > out behind all of the commits, but before the first OBJ_DICT_TREE, > and since all trees tend to get converted to an OBJ_DICT_TREE, > the EXTOBJ_FILENAME_TABLE is sandwiched exactly between the commits > and the trees. ... > The reason we put the EXTOBJ_FILENAME_TABLE behind the commits is > we often walk the commit chains (following parent pointers) without > looking at the trees at all. Consider `git log`, in the default > settings we don't need the trees. By keeping the filename table > behind the commits the OS read-ahead buffering gets a better chance > at loading all of the data we need, and none of the data we don't. > > So that's why its where it is. I just talked with Junio about this on #git. My real reason for putting the EXTOBJ_FILENAME_TABLE here is "lack of a better reason". I just didn't write that above. ;-) We want it before the first OBJ_DICT_TREE to help the unpackers. And just like we don't currently ever store the delta base for an OBJ_TREE before the first commit (as commits always get packed first) we also don't store the EXTOBJ_FILENAME_TREE before the first commit. Junio raised the point that in large projects `git log -- asm/i386` can be a very common/useful/necessary operation, and that in such cases we need to evaluate trees as part of the log operation. Any attempt to optimize for git-log without a path spec is wrong, wrong, wrong. I agree. The part I quoted above was not trying to imply that Nico and I are optimizing for using git-log without a path limiter. It just read that way to Junio, and may read that way for others too. Hence this follow-up. I'm open to suggestions about placement for EXTOBJ_FILENAME_TABLE, but I think its current position between commits and trees is the probably the best we can get. -- Shawn. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-03-27 6:55 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-03-24 20:23 Understanding version 4 packs Peter Eriksen 2007-03-24 23:24 ` Nicolas Pitre 2007-03-25 8:35 ` Peter Eriksen 2007-03-25 9:18 ` Shawn O. Pearce 2007-03-25 17:09 ` Linus Torvalds 2007-03-25 20:31 ` Shawn O. Pearce 2007-03-26 1:12 ` Nicolas Pitre 2007-03-26 2:02 ` Shawn O. Pearce 2007-03-26 8:49 ` Jakub Narebski 2007-03-26 14:01 ` Nicolas Pitre 2007-03-26 12:16 ` Marco Costalba 2007-03-26 14:27 ` Nicolas Pitre 2007-03-26 17:10 ` Marco Costalba 2007-03-26 18:15 ` Nicolas Pitre 2007-03-26 18:43 ` Nicolas Pitre 2007-03-27 6:46 ` Marco Costalba 2007-03-27 6:55 ` Shawn O. Pearce 2007-03-25 8:46 ` Shawn O. Pearce 2007-03-25 9:40 ` Shawn O. Pearce
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).