Packfile can't be mapped

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Packfile can't be mapped
@ 2006-08-28  1:04 Jon Smirl
  2006-08-28  2:47 ` Shawn Pearce
  0 siblings, 1 reply; 16+ messages in thread
From: Jon Smirl @ 2006-08-28  1:04 UTC (permalink / raw)
  To: git, Shawn Pearce

git-repack can't handle my 1.75GB pack file. I am running x86 with 3GB
address space.

-rw-rw-r-- 1 jonsmirl jonsmirl    47221712 Aug 27 20:29 testme.idx
-rw-rw-r-- 1 jonsmirl jonsmirl  1754317619 Aug 27 20:29 testme.pack

[jonsmirl@jonsmirl t1]$ git-repack -a -f --window=50 --depth=5000
Generating pack...
Done counting 1963325 objects.
fatal: packfile .git/objects/pack/testme.pack cannot be mapped.
[jonsmirl@jonsmirl t1]$

It is built from Mozilla CVS but it is an intermediate stage of our
work. The fast-import tool isn't diffing directory tree which makes
the pack much bigger than it needs to be. Shawn is working on the
packing code.

---------------------------------------------------
Alloc'd objects:    1968000 (   1892000 overflow  )
Total objects:      1967527 (     41856 duplicates)
      blobs  :       633842 (         0 duplicates)
      trees  :      1131208 (     41856 duplicates)
      commits:       200921 (         0 duplicates)
      tags   :         1556 (         0 duplicates)
Total branches:        1600 (      7985 loads     )
      marks:        1048576 (    200921 unique    )
      atoms:          56803
Memory total:         66908 KiB
       pools:          5408 KiB
     objects:         61500 KiB
Pack remaps:           9501
---------------------------------------------------
Pack size:          1713200 KiB
Index size:           46114 KiB


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  1:04 Packfile can't be mapped Jon Smirl
@ 2006-08-28  2:47 ` Shawn Pearce
  2006-08-28  4:27   ` Nicolas Pitre
  2006-08-29  4:52   ` Shawn Pearce
  0 siblings, 2 replies; 16+ messages in thread
From: Shawn Pearce @ 2006-08-28  2:47 UTC (permalink / raw)
  To: git; +Cc: Jon Smirl

Jon Smirl <jonsmirl@gmail.com> wrote:
> git-repack can't handle my 1.75GB pack file. I am running x86 with 3GB
> address space.
> 
> -rw-rw-r-- 1 jonsmirl jonsmirl    47221712 Aug 27 20:29 testme.idx
> -rw-rw-r-- 1 jonsmirl jonsmirl  1754317619 Aug 27 20:29 testme.pack
> 
> [jonsmirl@jonsmirl t1]$ git-repack -a -f --window=50 --depth=5000
> Generating pack...
> Done counting 1963325 objects.
> fatal: packfile .git/objects/pack/testme.pack cannot be mapped.
> [jonsmirl@jonsmirl t1]$
> 
> It is built from Mozilla CVS but it is an intermediate stage of our
> work. The fast-import tool isn't diffing directory tree which makes
> the pack much bigger than it needs to be. Shawn is working on the
> packing code.

I'm going to try to get tree deltas written to the pack sometime this
week. That should compact this intermediate pack down to something
that git-pack-objects would be able to successfully mmap into a
32 bit address space.  A complete repack with no delta reuse will
hopefully generate a pack closer to 400 MB in size.  But I know
Jon would like to get that pack even smaller.  :)

I should point out that the input stream to fast-import was 20 GB
(completely decompressed revisions from RCS) plus all commit data.
The original CVS ,v files are around 3 GB.  An archive .tar.gz'ing
the ,v files is around 550 MB.  Going to only 1.7 GB without tree
or commit deltas is certainly pretty good.  :)

> ---------------------------------------------------
> Alloc'd objects:    1968000 (   1892000 overflow  )
> Total objects:      1967527 (     41856 duplicates)
>       blobs  :       633842 (         0 duplicates)
>       trees  :      1131208 (     41856 duplicates)
>       commits:       200921 (         0 duplicates)
>       tags   :         1556 (         0 duplicates)
> Total branches:        1600 (      7985 loads     )
>       marks:        1048576 (    200921 unique    )
>       atoms:          56803
> Memory total:         66908 KiB
>        pools:          5408 KiB
>      objects:         61500 KiB
> Pack remaps:           9501
> ---------------------------------------------------
> Pack size:          1713200 KiB
> Index size:           46114 KiB

All of that says that aside from the 1.7 GB output file fast-import
ran extremely well.  About 1.9 million objects were written into
the output pack file, with 41k duplicate trees (duplicate blobs
were removed by cvs2svn prior to fast-import so they don't appear).
200k commits were created across 1600 branches.  And we did it in
only 67 MB of memory.

We also had ~8000 LRU cache misses related to our branch data;
this just means that cvs2svn likes to frequently jump around
between branches rather than import an entire branch at a time.
Boosting the size of the LRU cache (at the expense of needing more
memory) should reduce those cache misses as well as 'Pack remaps'.

I'd also like to clean up that pack remapping code and move it
into sha1_file.c.  Its an implementation of partial pack mapping
and it is apparently working quite well for us in fast-import.
It may help GIT deal with very large packs (e.g. 1.7 GB) on smaller
address space systems (e.g. 32 bit).

We're not confident that this import is completely valid yet.
We have a few translation issues we're still working on.  But now
that we have a complete pack going from start to finish we can start
to focus on those issues.  Especially since this entire process
(,v to .pack) is less than half a day to run.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  2:47 ` Shawn Pearce
@ 2006-08-28  4:27   ` Nicolas Pitre
  2006-08-28  4:36     ` Linus Torvalds
                       ` (2 more replies)
  2006-08-29  4:52   ` Shawn Pearce
  1 sibling, 3 replies; 16+ messages in thread
From: Nicolas Pitre @ 2006-08-28  4:27 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jon Smirl

On Sun, 27 Aug 2006, Shawn Pearce wrote:

> I'm going to try to get tree deltas written to the pack sometime this
> week. That should compact this intermediate pack down to something
> that git-pack-objects would be able to successfully mmap into a
> 32 bit address space.  A complete repack with no delta reuse will
> hopefully generate a pack closer to 400 MB in size.  But I know
> Jon would like to get that pack even smaller.  :)

One thing to consider in your code (if you didn't implement that 
already) is to _not_ attempt any delta on any object whose size is 
smaller than 50 bytes, and then limit the maximum delta size to 
object_size/2 - 20 (use that for the last argument to diff-delta() and 
store the undeltified object when diff-delta returns NULL).  This way 
you'll avoid creating delta objects that are most likely to end up being 
_larger_ than the undeltified object.

> I should point out that the input stream to fast-import was 20 GB
> (completely decompressed revisions from RCS) plus all commit data.
> The original CVS ,v files are around 3 GB.  An archive .tar.gz'ing
> the ,v files is around 550 MB.  Going to only 1.7 GB without tree
> or commit deltas is certainly pretty good.  :)

Good job indeed.  Oh and you probably should not bother trying to 
deltify commit objects at all since that would be a waste of time.

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  4:27   ` Nicolas Pitre
@ 2006-08-28  4:36     ` Linus Torvalds
  2006-08-28  6:00       ` Shawn Pearce
  2006-08-28 14:48       ` Nicolas Pitre
  2006-08-28  5:33     ` Shawn Pearce
  2006-08-28 16:42     ` Shawn Pearce
  2 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2006-08-28  4:36 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn Pearce, git, Jon Smirl

On Mon, 28 Aug 2006, Nicolas Pitre wrote:
> 
> Good job indeed.  Oh and you probably should not bother trying to 
> deltify commit objects at all since that would be a waste of time.

It might not necessarily always be a waste of time. Especially if you have 
multiple branches tracking a "maintenance" branch, you often end up having 
the same commit message repeated several times in "unrelated" commits 
(they're really the same commit, applied to another branch).

Also, I could imagine that some automated system generates very verbose 
(and possibly very regular) commit messages, so under certain 
circumstances it may well make sense to see if the commits migth delta 
against each other.

But I'll agree that in normal use it's not likely to be a huge saving, 
though. It's probably not worth doing for the fast importer unless it just 
happens to fall out of the code very easily.

		Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  4:27   ` Nicolas Pitre
  2006-08-28  4:36     ` Linus Torvalds
@ 2006-08-28  5:33     ` Shawn Pearce
  2006-08-28 16:42     ` Shawn Pearce
  2 siblings, 0 replies; 16+ messages in thread
From: Shawn Pearce @ 2006-08-28  5:33 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git, Jon Smirl

Nicolas Pitre <nico@cam.org> wrote:
> On Sun, 27 Aug 2006, Shawn Pearce wrote:
> 
> > I'm going to try to get tree deltas written to the pack sometime this
> > week. That should compact this intermediate pack down to something
> > that git-pack-objects would be able to successfully mmap into a
> > 32 bit address space.  A complete repack with no delta reuse will
> > hopefully generate a pack closer to 400 MB in size.  But I know
> > Jon would like to get that pack even smaller.  :)
> 
> One thing to consider in your code (if you didn't implement that 
> already) is to _not_ attempt any delta on any object whose size is 
> smaller than 50 bytes, and then limit the maximum delta size to 
> object_size/2 - 20 (use that for the last argument to diff-delta() and 
> store the undeltified object when diff-delta returns NULL).  This way 
> you'll avoid creating delta objects that are most likely to end up being 
> _larger_ than the undeltified object.

I haven't tried this.  Should be trivial to implement.  Thanks for
the suggestion.

> > I should point out that the input stream to fast-import was 20 GB
> > (completely decompressed revisions from RCS) plus all commit data.
> > The original CVS ,v files are around 3 GB.  An archive .tar.gz'ing
> > the ,v files is around 550 MB.  Going to only 1.7 GB without tree
> > or commit deltas is certainly pretty good.  :)
> 
> Good job indeed.  Oh and you probably should not bother trying to 
> deltify commit objects at all since that would be a waste of time.

I wasn't going to bother even trying to delta the commits.  In this
import the 200k commits isn't a very large percentage of the data.
As I'm sure you are well aware its pretty much a waste time to try
with the commits, especially with an "intermediate" pack such as
this one.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  4:36     ` Linus Torvalds
@ 2006-08-28  6:00       ` Shawn Pearce
  2006-08-28 14:15         ` Jon Smirl
  2006-08-28 14:40         ` Nicolas Pitre
  2006-08-28 14:48       ` Nicolas Pitre
  1 sibling, 2 replies; 16+ messages in thread
From: Shawn Pearce @ 2006-08-28  6:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, git, Jon Smirl

Linus Torvalds <torvalds@osdl.org> wrote:
> 
> On Mon, 28 Aug 2006, Nicolas Pitre wrote:
> > 
> > Good job indeed.  Oh and you probably should not bother trying to 
> > deltify commit objects at all since that would be a waste of time.
> 
> It might not necessarily always be a waste of time. Especially if you have 
> multiple branches tracking a "maintenance" branch, you often end up having 
> the same commit message repeated several times in "unrelated" commits 
> (they're really the same commit, applied to another branch).
> 
> Also, I could imagine that some automated system generates very verbose 
> (and possibly very regular) commit messages, so under certain 
> circumstances it may well make sense to see if the commits migth delta 
> against each other.
> 
> But I'll agree that in normal use it's not likely to be a huge saving, 
> though. It's probably not worth doing for the fast importer unless it just 
> happens to fall out of the code very easily.

Does git-pack-objects attempt to delta commits against each other?

I've been thinking about applying a pack-local but zlib-stream
global dictionary.  If we added three global dicationaries to the
front of the pack file, one for commits, one for trees and one
for blobs, and use those as the global dictionaries for the zlib
streams stored within that pack we could probably get a good space
savings for trees and commits.

I'd suspect that for many projects the commit global dictionary
would contain the common required strings such as:

  'tree ', 'parent ', 'committer ', 'author ', 'Signed-off-by: '

plus the top author/committer name/email combination strings.
For GIT I'd expect 'Junio C Hamano <junkio@cox.net>' to be way up
there in terms of frequency within commit objects.  Finding the most
common authors and committer strings would be trivial, as would
finding the most common 'footer' strings such as 'Signed-off-by: '
and 'Acked-by: '.

I think the same is true of trees, with '10644 ', '10755 ', '40000 '
being way up there, but also file names that commonly appear within
trees, e.g. "Makefile.in\0".

Blobs would be more difficult to generate a reasonable global
dictionary for.  But for some projects a crude estimated dictionary
can shave off at least 4% of pack size (true in both GIT and Mozilla
sources it seems).

Of course the major problem with pack-local, stream global
dictionaries is it voids the ability to reuse that zlib'd content
from that pack in another pack without wholesale copying the
dictionary as well.  This is an issue for servers which want to
copy out the pack entry without recompressing it but also want the
storage savings from the global dictionaries.

But then again, if we just delta against a commit which uses the
same author and committer, or against the same tree but different
version then there should be a lot of delta copying from the base...
which easily allows entry reuse and should provide similiar space
savings, providing the delta depth is deep enough (or the delta graph
is wide enough) to minimize the number of base objects containing
repeated occurrances of the common strings.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  6:00       ` Shawn Pearce
@ 2006-08-28 14:15         ` Jon Smirl
  2006-08-28 14:40         ` Nicolas Pitre
  1 sibling, 0 replies; 16+ messages in thread
From: Jon Smirl @ 2006-08-28 14:15 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Linus Torvalds, Nicolas Pitre, git

On 8/28/06, Shawn Pearce <spearce@spearce.org> wrote:
> Of course the major problem with pack-local, stream global
> dictionaries is it voids the ability to reuse that zlib'd content
> from that pack in another pack without wholesale copying the
> dictionary as well.  This is an issue for servers which want to
> copy out the pack entry without recompressing it but also want the
> storage savings from the global dictionaries.

If you copy an entire pack with a dictionary embedded in it everything
is fine. But if you pull objects out of the pack for transmission it
looks like they will need to be unpacked and repacked without the
dictionary. My plan was to only use dictionaries in archival packs
that would be used in a read only manner and copied around whole.

So there would need to something like an archive flag on git-repack
which would spend extra CPU time trying to make the pack as small as
possible. Normal use of git-repack wouldn't touch packs marked with
the archive flag since they were already in optimal form.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  6:00       ` Shawn Pearce
  2006-08-28 14:15         ` Jon Smirl
@ 2006-08-28 14:40         ` Nicolas Pitre
  2006-08-28 15:44           ` Jon Smirl
  2006-08-28 16:48           ` Shawn Pearce
  1 sibling, 2 replies; 16+ messages in thread
From: Nicolas Pitre @ 2006-08-28 14:40 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Linus Torvalds, git, Jon Smirl

On Mon, 28 Aug 2006, Shawn Pearce wrote:

> Linus Torvalds <torvalds@osdl.org> wrote:
> > 
> > On Mon, 28 Aug 2006, Nicolas Pitre wrote:
> > > 
> > > Good job indeed.  Oh and you probably should not bother trying to 
> > > deltify commit objects at all since that would be a waste of time.
> > 
> > It might not necessarily always be a waste of time. Especially if you have 
> > multiple branches tracking a "maintenance" branch, you often end up having 
> > the same commit message repeated several times in "unrelated" commits 
> > (they're really the same commit, applied to another branch).
> > 
> > Also, I could imagine that some automated system generates very verbose 
> > (and possibly very regular) commit messages, so under certain 
> > circumstances it may well make sense to see if the commits migth delta 
> > against each other.
> > 
> > But I'll agree that in normal use it's not likely to be a huge saving, 
> > though. It's probably not worth doing for the fast importer unless it just 
> > happens to fall out of the code very easily.
> 
> Does git-pack-objects attempt to delta commits against each other?

Yes it does.  But simply because it requires no special case in the code 
for that and it can afford spending more time trying to tighten a pack 
given that it wastes a lot of cycles with delta windows already.  In the 
context of an intermediate pack from fast-import I doubt it is worth it 
since the extra time spent on that won't give you a significant size 
reduction.

> I've been thinking about applying a pack-local but zlib-stream
> global dictionary.  If we added three global dicationaries to the
> front of the pack file, one for commits, one for trees and one
> for blobs, and use those as the global dictionaries for the zlib
> streams stored within that pack we could probably get a good space
> savings for trees and commits.
> 
> I'd suspect that for many projects the commit global dictionary
> would contain the common required strings such as:
> 
>   'tree ', 'parent ', 'committer ', 'author ', 'Signed-off-by: '
> 
> plus the top author/committer name/email combination strings.
> For GIT I'd expect 'Junio C Hamano <junkio@cox.net>' to be way up
> there in terms of frequency within commit objects.  Finding the most
> common authors and committer strings would be trivial, as would
> finding the most common 'footer' strings such as 'Signed-off-by: '
> and 'Acked-by: '.
> 
> I think the same is true of trees, with '10644 ', '10755 ', '40000 '
> being way up there, but also file names that commonly appear within
> trees, e.g. "Makefile.in\0".

Indeed.

> Blobs would be more difficult to generate a reasonable global
> dictionary for.  But for some projects a crude estimated dictionary
> can shave off at least 4% of pack size (true in both GIT and Mozilla
> sources it seems).
> 
> 
> Of course the major problem with pack-local, stream global
> dictionaries is it voids the ability to reuse that zlib'd content
> from that pack in another pack without wholesale copying the
> dictionary as well.  This is an issue for servers which want to
> copy out the pack entry without recompressing it but also want the
> storage savings from the global dictionaries.

Why would copying the dictionary as well be a problem?  How large might 
it be?  Can it be stored deflated itself?

> But then again, if we just delta against a commit which uses the
> same author and committer, or against the same tree but different
> version then there should be a lot of delta copying from the base...
> which easily allows entry reuse and should provide similiar space
> savings, providing the delta depth is deep enough (or the delta graph
> is wide enough) to minimize the number of base objects containing
> repeated occurrances of the common strings.

This is already attempted by pack-objects and used when it provides a 
gain.  It is just much less likely to happen than with trees or blobs.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  4:36     ` Linus Torvalds
  2006-08-28  6:00       ` Shawn Pearce
@ 2006-08-28 14:48       ` Nicolas Pitre
  1 sibling, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2006-08-28 14:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn Pearce, git, Jon Smirl

On Sun, 27 Aug 2006, Linus Torvalds wrote:

> 
> 
> On Mon, 28 Aug 2006, Nicolas Pitre wrote:
> > 
> > Good job indeed.  Oh and you probably should not bother trying to 
> > deltify commit objects at all since that would be a waste of time.
> 
> It might not necessarily always be a waste of time.
[...]
> It's probably not worth doing for the fast importer unless it just 
> happens to fall out of the code very easily.

... which is what my comment was all about.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28 14:40         ` Nicolas Pitre
@ 2006-08-28 15:44           ` Jon Smirl
  2006-08-28 16:43             ` Nicolas Pitre
  2006-08-28 16:48           ` Shawn Pearce
  1 sibling, 1 reply; 16+ messages in thread
From: Jon Smirl @ 2006-08-28 15:44 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn Pearce, Linus Torvalds, git

On 8/28/06, Nicolas Pitre <nico@cam.org> wrote:
> > Of course the major problem with pack-local, stream global
> > dictionaries is it voids the ability to reuse that zlib'd content
> > from that pack in another pack without wholesale copying the
> > dictionary as well.  This is an issue for servers which want to
> > copy out the pack entry without recompressing it but also want the
> > storage savings from the global dictionaries.
>
> Why would copying the dictionary as well be a problem?  How large might
> it be?  Can it be stored deflated itself?

The dictionaries are only 4KB. But they have to match up with the
objects compressed using them. If you bring an object straight down
out of a dictionary based pack and make it standalone you won't be
able to read it. You need the dictionary to decode it. What if the
local and remote packs have been packed using two different
dictionaries? You can't directly move objects between them.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  4:27   ` Nicolas Pitre
  2006-08-28  4:36     ` Linus Torvalds
  2006-08-28  5:33     ` Shawn Pearce
@ 2006-08-28 16:42     ` Shawn Pearce
  2006-08-28 17:19       ` Nicolas Pitre
  2 siblings, 1 reply; 16+ messages in thread
From: Shawn Pearce @ 2006-08-28 16:42 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git, Jon Smirl

Nicolas Pitre <nico@cam.org> wrote:
> On Sun, 27 Aug 2006, Shawn Pearce wrote:
> 
> > I'm going to try to get tree deltas written to the pack sometime this
> > week. That should compact this intermediate pack down to something
> > that git-pack-objects would be able to successfully mmap into a
> > 32 bit address space.  A complete repack with no delta reuse will
> > hopefully generate a pack closer to 400 MB in size.  But I know
> > Jon would like to get that pack even smaller.  :)
> 
> One thing to consider in your code (if you didn't implement that 
> already) is to _not_ attempt any delta on any object whose size is 
> smaller than 50 bytes, and then limit the maximum delta size to 
> object_size/2 - 20 (use that for the last argument to diff-delta() and 
> store the undeltified object when diff-delta returns NULL).  This way 
> you'll avoid creating delta objects that are most likely to end up being 
> _larger_ than the undeltified object.

So I added Nico's suggestions to fast-import and ran it on a small
subset of the Mozilla repository (3424 blobs):

  naive always delta: 6652 KiB
  Nico's suggestion:  6842 KiB

So Nico's suggestion of limiting delta size to (orig_len/2)-20 or
not using deltas on blobs < 50 bytes actually added 190 KB to the
output pack.  Since this sample is probably fairly representative
of the rest of the repository's blobs I'm thinking we may see a 2.8%
increase in size over the current 930 MB blob pack.  That's another
26 MB in our intermediate pack.  I don't think this suggestion is
really worth including in fast-import right now...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28 15:44           ` Jon Smirl
@ 2006-08-28 16:43             ` Nicolas Pitre
  0 siblings, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2006-08-28 16:43 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, Linus Torvalds, git

On Mon, 28 Aug 2006, Jon Smirl wrote:

> On 8/28/06, Nicolas Pitre <nico@cam.org> wrote:
> > > Of course the major problem with pack-local, stream global
> > > dictionaries is it voids the ability to reuse that zlib'd content
> > > from that pack in another pack without wholesale copying the
> > > dictionary as well.  This is an issue for servers which want to
> > > copy out the pack entry without recompressing it but also want the
> > > storage savings from the global dictionaries.
> >
> > Why would copying the dictionary as well be a problem?  How large might
> > it be?  Can it be stored deflated itself?
> 
> The dictionaries are only 4KB. But they have to match up with the
> objects compressed using them. If you bring an object straight down
> out of a dictionary based pack and make it standalone you won't be
> able to read it. You need the dictionary to decode it. What if the
> local and remote packs have been packed using two different
> dictionaries? You can't directly move objects between them.

I guess we'll be able to reuse the same dictionary when objects are all 
from the same pack.

If not then they could be recompressed which is costly but never as much 
as delta matching initially was when it was always done.

Anyway let's look at the size saving first.  The implied costs could 
then be evaluated and weighted accordingly.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28 14:40         ` Nicolas Pitre
  2006-08-28 15:44           ` Jon Smirl
@ 2006-08-28 16:48           ` Shawn Pearce
  1 sibling, 0 replies; 16+ messages in thread
From: Shawn Pearce @ 2006-08-28 16:48 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, git, Jon Smirl

Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 28 Aug 2006, Shawn Pearce wrote:
> > Of course the major problem with pack-local, stream global
> > dictionaries is it voids the ability to reuse that zlib'd content
> > from that pack in another pack without wholesale copying the
> > dictionary as well.  This is an issue for servers which want to
> > copy out the pack entry without recompressing it but also want the
> > storage savings from the global dictionaries.
> 
> Why would copying the dictionary as well be a problem?  How large might 
> it be?  Can it be stored deflated itself?

Largest size is like 200 bytes smaller than the window size.  Times 3
as we would want to store 3 dictionaries, though maybe only 2 if the
blob dictionary proves to be worse than not having one at all for a
given project.  Since its just a binary buffer holding bytes which
frequently appear in our compressed objects its easily deflatable;
especially when you consider its primarily storing US-ASCII text.

I was definately planning on the dictionary being deflated in the
pack.

We could alawys use the SHA1 of a dictionary to signal to the client
what dictionary was in use.  If the dictionary itself was treated
like any other SHA1 object then we might be able to transfer the
server's current dictionaries to the client along with everything
else in the same pack if the client doesn't have the necessary
dictionary yet.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28 16:42     ` Shawn Pearce
@ 2006-08-28 17:19       ` Nicolas Pitre
  0 siblings, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2006-08-28 17:19 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jon Smirl

On Mon, 28 Aug 2006, Shawn Pearce wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> > On Sun, 27 Aug 2006, Shawn Pearce wrote:
> > 
> > > I'm going to try to get tree deltas written to the pack sometime this
> > > week. That should compact this intermediate pack down to something
> > > that git-pack-objects would be able to successfully mmap into a
> > > 32 bit address space.  A complete repack with no delta reuse will
> > > hopefully generate a pack closer to 400 MB in size.  But I know
> > > Jon would like to get that pack even smaller.  :)
> > 
> > One thing to consider in your code (if you didn't implement that 
> > already) is to _not_ attempt any delta on any object whose size is 
> > smaller than 50 bytes, and then limit the maximum delta size to 
> > object_size/2 - 20 (use that for the last argument to diff-delta() and 
> > store the undeltified object when diff-delta returns NULL).  This way 
> > you'll avoid creating delta objects that are most likely to end up being 
> > _larger_ than the undeltified object.
> 
> So I added Nico's suggestions to fast-import and ran it on a small
> subset of the Mozilla repository (3424 blobs):
> 
>   naive always delta: 6652 KiB
>   Nico's suggestion:  6842 KiB

Hmmm...

> So Nico's suggestion of limiting delta size to (orig_len/2)-20 or
> not using deltas on blobs < 50 bytes actually added 190 KB to the
> output pack.  Since this sample is probably fairly representative
> of the rest of the repository's blobs I'm thinking we may see a 2.8%
> increase in size over the current 930 MB blob pack.  That's another
> 26 MB in our intermediate pack.  I don't think this suggestion is
> really worth including in fast-import right now...

The above is based on the assumption that undeltified blobs usually 
deflates to 50% the undeflated size or more, and that pure object data 
deflates better than delta data.  Then there is the 20 byte base object 
reference overhead for any deltas.  The 20 bytes is a hard fact.  The 
50% factor is a wild guess.  What I forgot to consider in the above 
formula is the fact that delta data gets deflated as well so the /2 
divisor is probably a bit too much (you could try orig_len*2/3 - 20, or 
orig-len - 20, and adjust the initial treshold so the limit value 
doesn't go negative).

If you are IO bound (I recall Jon mentioning something to that effect) 
then you could probably use some CPU cycles to always deflate the 
object, deflate the resulting delta, and pick the smallest between the 
two (don't forget the additional 20 bytes in the delta case).  Maybe the 
increased CPU usage won't justify this solution though.

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-28  2:47 ` Shawn Pearce
  2006-08-28  4:27   ` Nicolas Pitre
@ 2006-08-29  4:52   ` Shawn Pearce
  2006-08-29  5:33     ` Shawn Pearce
  1 sibling, 1 reply; 16+ messages in thread
From: Shawn Pearce @ 2006-08-29  4:52 UTC (permalink / raw)
  To: git; +Cc: Jon Smirl

Shawn Pearce <spearce@spearce.org> wrote:
> I'm going to try to get tree deltas written to the pack sometime this
> week.

I was able to implement and with Jon Smirl's help debug the tree
delta code in fast-import.
 
Earlier this evening Jon sent me the following:
> git-fast-import statistics:
> ---------------------------------------------------------------------
> Alloc'd objects:    1980000 (         0 overflow  )
> Total objects:      1967527 (     41856 duplicates                  )
>       blobs  :       633842 (         0 duplicates     576219 deltas)
>       trees  :      1131208 (     41856 duplicates    1019741 deltas)
>       commits:       200921 (         0 duplicates          0 deltas)
>       tags   :         1556 (         0 duplicates          0 deltas)
> Total branches:        1600 (      2228 loads     )
>       marks:        1048576 (    200921 unique    )
>       atoms:          56803
> Memory total:         75213 KiB
>        pools:         13338 KiB
>      objects:         61875 KiB
> Pack remaps:            658
> Pack size:           895983 KiB
> Index size:           46114 KiB
> ---------------------------------------------------------------------

Compared to our last attempt:
> > Pack size:          1713200 KiB
> > Index size:           46114 KiB

This tree delta version came out pretty good.  The pack with tree
deltas is 874 MiB.  Quite a reduction in size.  fast-import takes
about 20 minutes to convert its 20 GiB input file into this 874 MiB
pack.  Producing the 20 GiB input file from the 3 GiB CVS ,v
files takes about 4 hours with Jon's modified cvs2svn.

Jon has started a `git-repack -a -f` with aggressive depth and
window sizes.  He estimated it may need another 2.5 hours to process.
Hopefully I'll hear more details tomorrow.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Packfile can't be mapped
  2006-08-29  4:52   ` Shawn Pearce
@ 2006-08-29  5:33     ` Shawn Pearce
  0 siblings, 0 replies; 16+ messages in thread
From: Shawn Pearce @ 2006-08-29  5:33 UTC (permalink / raw)
  To: git; +Cc: Jon Smirl

Shawn Pearce <spearce@spearce.org> wrote:
> This tree delta version came out pretty good.  The pack with tree
> deltas is 874 MiB.  Quite a reduction in size.  fast-import takes
> about 20 minutes to convert its 20 GiB input file into this 874 MiB
> pack.  Producing the 20 GiB input file from the 3 GiB CVS ,v
> files takes about 4 hours with Jon's modified cvs2svn.
> 
> Jon has started a `git-repack -a -f` with aggressive depth and
> window sizes.  He estimated it may need another 2.5 hours to process.
> Hopefully I'll hear more details tomorrow.

I just heard from Jon:
> git-repack -a -f --window=50 --depth=5000
> 100% CPU for 60 minutes
> 1.2GB resident memory
> Final pack size is 451,203,363 bytes.

So with very agressive delta depth and window sizes git-repack took
a while to run but came very close to the best packed size from
previous Mozilla CVS import attempts.  I think we'd still like to
make the final historical pack smaller than that.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-08-29  5:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-28  1:04 Packfile can't be mapped Jon Smirl
2006-08-28  2:47 ` Shawn Pearce
2006-08-28  4:27   ` Nicolas Pitre
2006-08-28  4:36     ` Linus Torvalds
2006-08-28  6:00       ` Shawn Pearce
2006-08-28 14:15         ` Jon Smirl
2006-08-28 14:40         ` Nicolas Pitre
2006-08-28 15:44           ` Jon Smirl
2006-08-28 16:43             ` Nicolas Pitre
2006-08-28 16:48           ` Shawn Pearce
2006-08-28 14:48       ` Nicolas Pitre
2006-08-28  5:33     ` Shawn Pearce
2006-08-28 16:42     ` Shawn Pearce
2006-08-28 17:19       ` Nicolas Pitre
2006-08-29  4:52   ` Shawn Pearce
2006-08-29  5:33     ` Shawn Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).