Decompression speed: zip vs lzo

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Decompression speed: zip vs lzo
@ 2008-01-09 22:01 Marco Costalba
  2008-01-09 22:55 ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Marco Costalba @ 2008-01-09 22:01 UTC (permalink / raw)
  To: Git Mailing List

I have created a big tar from linux tree:

linux-2.6.tar   300,0 MB

Then I have created to compressed files with zip and lzop utility (the
latter uses the lzo compression algorithm):

linux-2.6.zip  70,1 MB

linux-2.6.tar.lzo  108,0 MB

Then I have tested the decompression speed:

$ time unzip -p linux-2.6.zip > /dev/null
3.95user 0.09system 0:04.05elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+189minor)pagefaults 0swaps

$ time lzop -d -c linux-2.6.tar.lzo > /dev/null
2.10user 0.07system 0:02.18elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+234minor)pagefaults 0swaps

So bottom line is that lzo decompression speed is almost the double of zip.

Marco

P.S: Compression size is better for zip but a more realistic test
would be to try with a delta packaged repo instead of a simple tar of
source files. Because delta packaged is already compressed in his way
perhaps difference in final file sizes is smaller.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-09 22:01 Decompression speed: zip vs lzo Marco Costalba
@ 2008-01-09 22:55 ` Junio C Hamano
  2008-01-09 23:23   ` Sam Vilain
  0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2008-01-09 22:55 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Git Mailing List

"Marco Costalba" <mcostalba@gmail.com> writes:

> P.S: Compression size is better for zip but a more realistic test
> would be to try with a delta packaged repo instead of a simple tar of
> source files. Because delta packaged is already compressed in his way
> perhaps difference in final file sizes is smaller.

Note that the space nor time performance of compressing and
uncompressing a single huge blob is not as interesting in the
context of git as compressing/uncompressing millions of small
pieces whose total size is comparable to the specimen of "huge
single blob" experiment.  Obviously loose object files are
compressed individually, and packfile contents are also
individually and independently compressed.  Set-up cost for
individual invocation of compression and uncompression on
smaller data matters a lot more than an experiment on
compressing and uncompressiong a single huge blob (this applies
to both time and space).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-09 22:55 ` Junio C Hamano
@ 2008-01-09 23:23   ` Sam Vilain
  2008-01-09 23:31     ` Johannes Schindelin
  2008-01-09 23:49     ` Junio C Hamano
  0 siblings, 2 replies; 39+ messages in thread
From: Sam Vilain @ 2008-01-09 23:23 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Git Mailing List, Junio C Hamano

Junio C Hamano wrote:
> Note that the space nor time performance of compressing and
> uncompressing a single huge blob is not as interesting in the
> context of git as compressing/uncompressing millions of small
> pieces whose total size is comparable to the specimen of "huge
> single blob" experiment.  Obviously loose object files are
> compressed individually, and packfile contents are also
> individually and independently compressed.  Set-up cost for
> individual invocation of compression and uncompression on
> smaller data matters a lot more than an experiment on
> compressing and uncompressiong a single huge blob (this applies
> to both time and space).

Yes - and lzo will almost certainly win on all those counts!

I think to go forward this would need a prototype and benchmark figures
for things like "annotate" and "fsck --full" - but bear in mind it would
be a long road to follow-up to completion, as repository compatibility
would need to be a primary concern and this essentially would create a
new pack type AND a new *object* type.  Not only that, but currently
there is no header in the objects on disk which can be used to detect a
gzip vs. an lzop stream.  Not really worth it IMHO - gzip is already
fast enough on even the most modern processor these days.

Sam.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-09 23:23   ` Sam Vilain
@ 2008-01-09 23:31     ` Johannes Schindelin
  2008-01-10  1:02       ` Sam Vilain
  2008-01-10  3:41       ` Nicolas Pitre
  2008-01-09 23:49     ` Junio C Hamano
  1 sibling, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2008-01-09 23:31 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Marco Costalba, Git Mailing List, Junio C Hamano

Hi,

On Thu, 10 Jan 2008, Sam Vilain wrote:

> I think to go forward this would need a prototype and benchmark figures 
> for things like "annotate" and "fsck --full" - but bear in mind it would 
> be a long road to follow-up to completion, as repository compatibility 
> would need to be a primary concern and this essentially would create a 
> new pack type AND a new *object* type.

No new object type.  Why should it?  But it has to have a config variable 
which says what type of packs/loose objects it has (and you will not be 
able to mix them).

> Not really worth it IMHO - gzip is already fast enough on even the most 
> modern processor these days.

I agree that gzip is already fast enough.

However, pack v4 had more goodies than just being faster; it also promised 
to have smaller packs.  And pack v4 would need to have the same 
infrastructure of repacking if the client does not understand v4 packs.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-09 23:23   ` Sam Vilain
  2008-01-09 23:31     ` Johannes Schindelin
@ 2008-01-09 23:49     ` Junio C Hamano
  1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2008-01-09 23:49 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Marco Costalba, Git Mailing List

Sam Vilain <sam@vilain.net> writes:

> I think to go forward this would need a prototype and benchmark figures
> for things like "annotate" and "fsck --full" - but bear in mind it would
> be a long road to follow-up to completion, as repository compatibility
> would need to be a primary concern and this essentially would create a
> new pack type AND a new *object* type.  Not only that, but currently
> there is no header in the objects on disk which can be used to detect a
> gzip vs. an lzop stream.  Not really worth it IMHO - gzip is already
> fast enough on even the most modern processor these days.

For the compression type detection, I was hoping that we could
do something like sha1_file.c::legacy_loose_object(), but I tend
to agree it is not probably worth it.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-09 23:31     ` Johannes Schindelin
@ 2008-01-10  1:02       ` Sam Vilain
  2008-01-10  5:02         ` Sam Vilain
  2008-01-10  3:41       ` Nicolas Pitre
  1 sibling, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2008-01-10  1:02 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailing List

Johannes Schindelin wrote:
> No new object type.  Why should it?  But it has to have a config variable 
> which says what type of packs/loose objects it has (and you will not be 
> able to mix them).

I meant loose object.  However this is configured, it affects things
like HTTP push/pull.  Configuring like that would be a bit too fragile
for my tastes.

>> Not really worth it IMHO - gzip is already fast enough on even the most 
>> modern processor these days.
> 
> I agree that gzip is already fast enough.
> 
> However, pack v4 had more goodies than just being faster; it also promised 
> to have smaller packs.  And pack v4 would need to have the same 
> infrastructure of repacking if the client does not understand v4 packs.

Ineed - I think it would be a lot easier to implement if it didn't
bother with loose objects.  It can just be a new pack version with more
compression formats.  For when you know you're going to be doing a lot
of analysis you'd already run "git-repack -a -f" to shorten the deltas,
so this might be a useful option for some - but again I'd want to see
figures first.

I do really like LZOP as far as compression algorithms go.  It seems a
lot faster for not a huge loss in ratio.

Sam.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-09 23:31     ` Johannes Schindelin
  2008-01-10  1:02       ` Sam Vilain
@ 2008-01-10  3:41       ` Nicolas Pitre
  2008-01-10  6:55         ` Marco Costalba
  1 sibling, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2008-01-10  3:41 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Sam Vilain, Marco Costalba, Git Mailing List, Junio C Hamano

On Wed, 9 Jan 2008, Johannes Schindelin wrote:

> I agree that gzip is already fast enough.
> 
> However, pack v4 had more goodies than just being faster; it also promised 
> to have smaller packs.

Right, like not having to compress tree objects and half of commit 
objects at all.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10  1:02       ` Sam Vilain
@ 2008-01-10  5:02         ` Sam Vilain
  2008-01-10  9:16           ` Pierre Habouzit
  0 siblings, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2008-01-10  5:02 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Johannes Schindelin, Marco Costalba, Junio C Hamano

Sam Vilain wrote:
> I do really like LZOP as far as compression algorithms go.  It seems a
> lot faster for not a huge loss in ratio.

Coincidentally, I read this today on an algorithm (LZMA - same as 7zip)
which is very slow to compress, high ratio but quick decompression:

  http://use.perl.org/~acme/journal/35330

Which sounds excellent for squeezing those "archive packs" into even
more ridiculously tiny spaces.

Samn.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10  3:41       ` Nicolas Pitre
@ 2008-01-10  6:55         ` Marco Costalba
  2008-01-10 11:45           ` Marco Costalba
  2008-01-10 19:34           ` Dana How
  0 siblings, 2 replies; 39+ messages in thread
From: Marco Costalba @ 2008-01-10  6:55 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Sam Vilain, Git Mailing List, Junio C Hamano

On Jan 10, 2008 4:41 AM, Nicolas Pitre <nico@cam.org> wrote:
> On Wed, 9 Jan 2008, Johannes Schindelin wrote:
>
> > I agree that gzip is already fast enough.
> >
> > However, pack v4 had more goodies than just being faster; it also promised
> > to have smaller packs.
>
> Right, like not having to compress tree objects and half of commit
> objects at all.
>
>

Decompression speed has been shown to be a bottle neck on some tests
involving mainly 'git log'.

Regarding back compatibility I really don't know at what level git
functions actually need to know the compression format, looking at the
code I would say at very low level, functions that deal directly with
inflate() and friends are few [1] and not directly connected to UI,
nor to git config. Is this compression format something user should
know/care? and if yes why?

In my tests the assumption of a source files tar ball is unrealistic,
to test the final size difference I would like testing different
compressions on a big already packaged but still not zipped file.
Someone could be so kind to hint me on how to create such a package
with good quality, i.e. with packaging levels similar to what is done
for public repos?

This does not realistically tests speed because as Junio pointed out
the real decompressing schema is different: many calls on small
objects, not one call on a big one. But if final size is acceptable we
can go on more difficult tests.

Marco

[1] where inflate() is called:

-inflate_it() in builtin-apply.c
-check_pack_inflate() in builtin-pack-objects.c
-get_data() in builtin-unpack-objects.c
-fwrite_sha1_file() in http-push.c and http-walker.c  [mmm interesting
same function in two files, also the signature and the contents seems
the same....]
-unpack_entry_data() in index-pack.c
-unpack_sha1_header(), unpack_sha1_rest(), get_size_from_delta(),
unpack_compressed_entry, write_sha1_from_fd() in sha1_file.c

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10  5:02         ` Sam Vilain
@ 2008-01-10  9:16           ` Pierre Habouzit
  2008-01-10 20:39             ` Nicolas Pitre
  0 siblings, 1 reply; 39+ messages in thread
From: Pierre Habouzit @ 2008-01-10  9:16 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Git Mailing List, Johannes Schindelin, Marco Costalba,
	Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2477 bytes --]

On Thu, Jan 10, 2008 at 07:02:39AM +0000, Sam Vilain wrote:
> Sam Vilain wrote:
> > I do really like LZOP as far as compression algorithms go.  It seems a
> > lot faster for not a huge loss in ratio.
> 
> Coincidentally, I read this today on an algorithm (LZMA - same as 7zip)
> which is very slow to compress, high ratio but quick decompression:
>
>   http://use.perl.org/~acme/journal/35330
>
> Which sounds excellent for squeezing those "archive packs" into even
> more ridiculously tiny spaces.

Well, lzma is excellent for *big* chunks of data, but not that impressive for
small files:

$ ll git.c git.c.gz git.c.lzma git.c.lzop
-rw-r--r-- 1 madcoder madcoder 12915 2008-01-09 13:47 git.c
-rw-r--r-- 1 madcoder madcoder  4225 2008-01-10 10:00 git.c.gz
-rw-r--r-- 1 madcoder madcoder  4094 2008-01-10 10:00 git.c.lzma
-rw-r--r-- 1 madcoder madcoder  5068 2008-01-10 09:59 git.c.lzop


And lzma performs really bad if you have few memory available. The "big" secret
of lzma is that it basically works with a huge window to check for repetitive
data, and even decompression needs quite a fair amount of memory, making it a
really bad choice for git IMNSHO.

Though I don't agree with you (and some others) about the fact that gzip is
fast enough. It's clearly a bottleneck in many log related commands where you
would expect it to be rather IO bound than CPU bound.  LZO seems like a fairer
choice, especially since what it makes gain is basically the compression of the
biggest blobs, aka the delta chains heads. It's really unclear to me if we
really gain in compressing the deltas, trees, and other smallish informations.

And when it comes to times, for a big file enough to give numbers, here are the
decompression times (best of 10 runs, smaller is better, second number is the
size of the packed data, original data was 7.8Mo):
  * lzma: 0.374s (2.2Mo)
  * gzip: 0.127s (2.9Mo)
  * lzop: 0.053s (3.2Mo)

For a 300k original file:
  * lzma: 0.022s (124Ko)
  * gzip: 0.008s (144Ko)
  * lzop: 0.004s (156Ko) /* most of the samples were actually 0.005 */

What is obvious to me is that lzop seems to take 10% more space than gzip,
while being around 1.5 to 2 times faster. Of course this is very sketchy and a
real test with git will be better.
-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10  6:55         ` Marco Costalba
@ 2008-01-10 11:45           ` Marco Costalba
  2008-01-10 12:12             ` Johannes Schindelin
  2008-01-10 19:34           ` Dana How
  1 sibling, 1 reply; 39+ messages in thread
From: Marco Costalba @ 2008-01-10 11:45 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Sam Vilain, Git Mailing List, Junio C Hamano

On Jan 10, 2008 7:55 AM, Marco Costalba <mcostalba@gmail.com> wrote:
>
> [1] where inflate() is called:
>
> -inflate_it() in builtin-apply.c
> -check_pack_inflate() in builtin-pack-objects.c
> -get_data() in builtin-unpack-objects.c
> -fwrite_sha1_file() in http-push.c and http-walker.c  [mmm interesting
> same function in two files, also the signature and the contents seems
> the same....]
> -unpack_entry_data() in index-pack.c
> -unpack_sha1_header(), unpack_sha1_rest(), get_size_from_delta(),
> unpack_compressed_entry, write_sha1_from_fd() in sha1_file.c
>

Looking at the git sources I have found that zip routines are
candidate for a cleaning up, as example the more or less very similar
lines of code are repeated many times in git files:

memset(&stream, 0, sizeof(stream));
deflateInit(&stream, pack_compression_level);
maxsize = deflateBound(&stream, size);
out = xmalloc(maxsize);
stream.next_out = out;
stream.avail_out = maxsize;


So what I'm planning to do to test with different algorithms is first
a cleanup work that is more or less the following

- Remove #include <zlib.h> from cache.h and substitute with #include
"compress.h"

- Add #include <zlib.h> where it is "really" intended as example archive-zip.c

- Rename inflate()/deflate() and other zlib calls with corresponding
  zlib_inflate()
  zlib_deflate()

and declared in compress.h

- Define zlib_inflate() and friends as simple wrappers to
corresponding zlib function

- Test if everything is ok (should be only code shuffling/renaming until now)

- Start cleaning up as example adding a do_deflateInit() that wraps
all the code I have reported above and that involves deflateInit()

- When compression routines are cleaned up add new functions

do_inflate(), do_deflate() instead of zlib_* ones that wrap the
compression alghorithm dispatching logic.

Dispatching could be choose in different ways going from

- compile time (at #define level)
- config (some configuration value stored in some global variable)
- dynamic (at run time, with no configuration needed, I have some
ideas on this ;-)


Comments?

Thanks
Marco

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 11:45           ` Marco Costalba
@ 2008-01-10 12:12             ` Johannes Schindelin
  2008-01-10 12:18               ` Marco Costalba
  0 siblings, 1 reply; 39+ messages in thread
From: Johannes Schindelin @ 2008-01-10 12:12 UTC (permalink / raw)
  To: Marco Costalba
  Cc: Nicolas Pitre, Sam Vilain, Git Mailing List, Junio C Hamano

Hi,

On Thu, 10 Jan 2008, Marco Costalba wrote:

> - Remove #include <zlib.h> from cache.h and substitute with #include
> "compress.h"

No.  We will always need zlib for compatibility.  You cannot just replace 
zlib usage in git.

> - Add #include <zlib.h> where it is "really" intended as example 
> archive-zip.c

We have a long tradition to have the system includes in cache.h.

Besides, if you have "compress.h" included in cache.h, which in turn has 
to include "zlib.h", what is the use of putting it also in archive-zip.c?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 12:12             ` Johannes Schindelin
@ 2008-01-10 12:18               ` Marco Costalba
  0 siblings, 0 replies; 39+ messages in thread
From: Marco Costalba @ 2008-01-10 12:18 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nicolas Pitre, Sam Vilain, Git Mailing List, Junio C Hamano

On Jan 10, 2008 1:12 PM, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Thu, 10 Jan 2008, Marco Costalba wrote:
>
> > - Remove #include <zlib.h> from cache.h and substitute with #include
> > "compress.h"
>
> No.  We will always need zlib for compatibility.  You cannot just replace
> zlib usage in git.
>

Ok. This was just to check what is broken by removing zlib.h so that
I'm sure to renaming all the zlib related stuff.

But I agree this is most a development detail and I can do this just
in my private tree to help me hacking the patches.

Thanks
Marco

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10  6:55         ` Marco Costalba
  2008-01-10 11:45           ` Marco Costalba
@ 2008-01-10 19:34           ` Dana How
  1 sibling, 0 replies; 39+ messages in thread
From: Dana How @ 2008-01-10 19:34 UTC (permalink / raw)
  To: Marco Costalba
  Cc: Nicolas Pitre, Johannes Schindelin, Sam Vilain, Git Mailing List,
	Junio C Hamano, danahow

On Jan 9, 2008 10:55 PM, Marco Costalba <mcostalba@gmail.com> wrote:
> On Jan 10, 2008 4:41 AM, Nicolas Pitre <nico@cam.org> wrote:
> > On Wed, 9 Jan 2008, Johannes Schindelin wrote:
> >
> > > I agree that gzip is already fast enough.
> > >
> > > However, pack v4 had more goodies than just being faster; it also promised
> > > to have smaller packs.
> >
> > Right, like not having to compress tree objects and half of commit
> > objects at all.
>
> Decompression speed has been shown to be a bottle neck on some tests
> involving mainly 'git log'.

Thanks for looking into this,  in this email and your subsequent ones.

I agree that zip time is an issue.  I was looking into reducing the _number_
of zip calls on the same data,  but work and personal crises have reduced
me from an infrequent contributor to an occasional gadfly for the moment.

> Regarding back compatibility I really don't know at what level git
> functions actually need to know the compression format, looking at the
> code I would say at very low level, functions that deal directly with
> inflate() and friends are few [1] and not directly connected to UI,
> nor to git config. Is this compression format something user should
> know/care? and if yes why?
>
> In my tests the assumption of a source files tar ball is unrealistic,
> to test the final size difference I would like testing different
> compressions on a big already packaged but still not zipped file.
> Someone could be so kind to hint me on how to create such a package
> with good quality, i.e. with packaging levels similar to what is done
> for public repos?
>
> This does not realistically tests speed because as Junio pointed out
> the real decompressing schema is different: many calls on small
> objects, not one call on a big one. But if final size is acceptable we
> can go on more difficult tests.

The approach you're taking (here and in following emails) of being
able to make zip/lzo selection and measure the results should be
enlightening.  For the vast majority of git users,  Junio's scenario
is the most relevant.

Of additional interest to me is handling enormous objects more quickly.
I would like to replace some p4 usage here with git,  but most users will
only notice the speed difference and not use git's extra features.  Thus
they will compare git add/git commit/git push unfavorably to p4 edit/p4 submit,
because the former effectively does zip/unzip/zip/send,  while the latter
only does zip/send (git's extra "unzip/zip" comes from loose objects not
being directly copyable into packs).  This speed difference is irrelevant
for small to normal files,  but a killer when commiting a collection of say
100MB files.

Your lzo option could reduce this performance degradation vs p4 from
3x to close to 1.5x.  If you get it accepted,  I'd love to then "fix" the loose
object copying "problem" making git _faster_ than p4 on large files!
2 simple forms for this "fix" would be to use the once-and-future "new"
loose object format (an idea already rejected),  or to encode all loose
objects as singleton packs under .git/objects/xx (so that all (re)packing,
in the absence of new deltification,  becomes pack-to-pack copying).
This latter idea is a modification of an idea from Nicolas Pitre.
It certainly adds less code than other approaches for such a "fix".

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10  9:16           ` Pierre Habouzit
@ 2008-01-10 20:39             ` Nicolas Pitre
  2008-01-10 21:01               ` Linus Torvalds
                                 ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Nicolas Pitre @ 2008-01-10 20:39 UTC (permalink / raw)
  To: Pierre Habouzit
  Cc: Sam Vilain, Git Mailing List, Johannes Schindelin, Marco Costalba,
	Junio C Hamano

On Thu, 10 Jan 2008, Pierre Habouzit wrote:

> Well, lzma is excellent for *big* chunks of data, but not that impressive for
> small files:
> 
> $ ll git.c git.c.gz git.c.lzma git.c.lzop
> -rw-r--r-- 1 madcoder madcoder 12915 2008-01-09 13:47 git.c
> -rw-r--r-- 1 madcoder madcoder  4225 2008-01-10 10:00 git.c.gz
> -rw-r--r-- 1 madcoder madcoder  4094 2008-01-10 10:00 git.c.lzma
> -rw-r--r-- 1 madcoder madcoder  5068 2008-01-10 09:59 git.c.lzop

This is really the big point here.  Git uses _lots_ of *small* objects, 
usually much smaller than 12KB.  For example, my copy of the gcc 
repository has an average of 270 _bytes_ per compressed object, and 
objects must be individually compressed.

Performance with really small objects should be the basis for any 
Git compression algorithm comparison.

> Though I don't agree with you (and some others) about the fact that 
> gzip is fast enough. It's clearly a bottleneck in many log related 
> commands where you would expect it to be rather IO bound than CPU 
> bound.  LZO seems like a fairer choice, especially since what it makes 
> gain is basically the compression of the biggest blobs, aka the delta 
> chains heads.

The delta heads, though, are far from being the most frequently accessed 
objects.  First they're clearly in minority, and often cached in the 
delta base cache.

> It's really unclear to me if we really gain in 
> compressing the deltas, trees, and other smallish informations.

Remember that delta objects represent the vast majority of all objects. 
For example, my kernel repo currently has 555015 delta objects out of 
677073 objects, or 82% of the total.  There is actually only 25869 non 
deltified blob objects which are likely to be the larger objects, but 
they represent only 4% of the total.

But just let's try not compressing delta objects so to check your 
assertion with the following hack:

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index a39cb82..252b03e 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -433,7 +433,10 @@ static unsigned long write_object(struct sha1file *f,
 		}
 		/* compress the data to store and put compressed length in datalen */
 		memset(&stream, 0, sizeof(stream));
-		deflateInit(&stream, pack_compression_level);
+		if (obj_type == OBJ_REF_DELTA || obj_type == OBJ_OFS_DELTA)
+			deflateInit(&stream, 0);
+		else
+			deflateInit(&stream, pack_compression_level);
 		maxsize = deflateBound(&stream, size);
 		out = xmalloc(maxsize);
 		/* Compress it */

You then only need to run 'git repack -a -f -d' with and without the 
above patch.

Here's my rather surprising results:

My kernel repo pack size without the patch:	184275401 bytes
Same repo with the above patch applied:		205204930 bytes

So it is only 11% larger.  I was expecting much more.

I'll let someone else do profiling/timing comparisons.

> What is obvious to me is that lzop seems to take 10% more space than gzip,
> while being around 1.5 to 2 times faster. Of course this is very sketchy and a
> real test with git will be better.

Right.  Abstracting the zlib code and having different compression 
algorithms tested in the Git context is the only way to do meaningful 
comparisons.

Nicolas

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 20:39             ` Nicolas Pitre
@ 2008-01-10 21:01               ` Linus Torvalds
  2008-01-10 21:30                 ` Nicolas Pitre
  2008-01-10 21:45                 ` Sam Vilain
  2008-01-10 21:51               ` Marco Costalba
                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2008-01-10 21:01 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Pierre Habouzit, Sam Vilain, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

On Thu, 10 Jan 2008, Nicolas Pitre wrote:
> 
> Here's my rather surprising results:
> 
> My kernel repo pack size without the patch:	184275401 bytes
> Same repo with the above patch applied:		205204930 bytes
> 
> So it is only 11% larger.  I was expecting much more.

It's probably worth doing those statistics on some other projects.

The kernel has for the last five+ years very much encouraged people to 
make series of small changes, so I would not be surprised if it turns out 
that the deltas for the kernel are smaller than average, if only because 
the whole development process has encouraged people to send in a series of 
ten patches rather than a single larger one.

And there are basically *no* generated files in the kernel source repo.

Maybe the difference to other repositories isn't huge, and maybe the 
kernel *is* a good test-case, but I just wouldn't take that for granted. 

Yes, delta's are bound to compress much less well than non-deltas, and 
especially for tree objects (which is a large chunk of them) they probably 
compress even less (because a big part of the delta is actually just the 
SHA1 changes), but if it's 11% on the kernel, it could easily be 25% on 
something else.

Try with the gcc repo, especially the one that has deep delta chains (so 
it has even *more* deltas in relation to full objects than the kernel has)

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 21:01               ` Linus Torvalds
@ 2008-01-10 21:30                 ` Nicolas Pitre
  2008-01-11  8:57                   ` Pierre Habouzit
  2008-01-10 21:45                 ` Sam Vilain
  1 sibling, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2008-01-10 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pierre Habouzit, Sam Vilain, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

On Thu, 10 Jan 2008, Linus Torvalds wrote:

> 
> 
> On Thu, 10 Jan 2008, Nicolas Pitre wrote:
> > 
> > Here's my rather surprising results:
> > 
> > My kernel repo pack size without the patch:	184275401 bytes
> > Same repo with the above patch applied:		205204930 bytes
> > 
> > So it is only 11% larger.  I was expecting much more.
> 
> It's probably worth doing those statistics on some other projects.
> 
> Maybe the difference to other repositories isn't huge, and maybe the 
> kernel *is* a good test-case, but I just wouldn't take that for granted. 

Obviously.

This was a really crud test, and my initial goal was to quickly dismiss 
Pierre's assertion.  Turns out that he wasn't that wrong after all, and 
if a significant increase in access speed by avoiding zlib for 82% of 
object accesses can also be demonstrated for the kernel, then we have an 
opportunity for some optimization tradeoff with no backward 
compatibility concerns.

> Yes, delta's are bound to compress much less well than non-deltas, and 
> especially for tree objects (which is a large chunk of them) they probably 
> compress even less (because a big part of the delta is actually just the 
> SHA1 changes), but if it's 11% on the kernel, it could easily be 25% on 
> something else.

Right.  But again this is not worth pursuing if a significant speed 
increase in repo access is not demonstrated at least with the kernel.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 21:01               ` Linus Torvalds
  2008-01-10 21:30                 ` Nicolas Pitre
@ 2008-01-10 21:45                 ` Sam Vilain
  2008-01-10 22:03                   ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2008-01-10 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

Linus Torvalds wrote:
> Maybe the difference to other repositories isn't huge, and maybe the 
> kernel *is* a good test-case, but I just wouldn't take that for granted. 
>
> Try with the gcc repo, especially the one that has deep delta chains (so 
> it has even *more* deltas in relation to full objects than the kernel has)

For reference, 20 years of Perl with very deep deltas:

wilber:~/src/perl-preview$ du -sk .git
73274   .git
wilber:~/src/perl-preview$ git-repack -a
Counting objects: 244360, done.
Compressing objects: 100% (55493/55493), done.
Writing objects: 100% (244360/244360), done.
Total 244360 (delta 181061), reused 244360 (delta 181061)
wilber:~/src/perl-preview$ du -sk .git/objects/pack/
75389   .git/objects/pack/
wilber:~/src/perl-preview$

There are a few generated files in this history, but really only yacc
files etc.  It is in general also a lot of small changes.

Sam.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 20:39             ` Nicolas Pitre
  2008-01-10 21:01               ` Linus Torvalds
@ 2008-01-10 21:51               ` Marco Costalba
  2008-01-10 22:01                 ` Sam Vilain
  2008-01-10 22:18                 ` Nicolas Pitre
  2008-01-11  9:45               ` Pierre Habouzit
  2008-01-11 14:18               ` Morten Welinder
  3 siblings, 2 replies; 39+ messages in thread
From: Marco Costalba @ 2008-01-10 21:51 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Pierre Habouzit, Sam Vilain, Git Mailing List,
	Johannes Schindelin, Junio C Hamano

On Jan 10, 2008 9:39 PM, Nicolas Pitre <nico@cam.org> wrote:
>
> Right.  Abstracting the zlib code and having different compression
> algorithms tested in the Git context is the only way to do meaningful
> comparisons.
>

The first thing I would like to test when zlib abstraction is ready is
to test with NULL compressor, i.e. not compression/decompression at
all and see if 'git log' and friends are happy.

BTW would be possible to test git with zlib disabled also now? I mean
there is a quick hack to disable zlib not only in writing but also in
reading, so that we can see what happens when running a repository
packed without compression?

Thanks
Marco

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 21:51               ` Marco Costalba
@ 2008-01-10 22:01                 ` Sam Vilain
  2008-01-10 22:18                 ` Nicolas Pitre
  1 sibling, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2008-01-10 22:01 UTC (permalink / raw)
  To: Marco Costalba
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Junio C Hamano

Marco Costalba wrote:
> BTW would be possible to test git with zlib disabled also now? I mean
> there is a quick hack to disable zlib not only in writing but also in
> reading, so that we can see what happens when running a repository
> packed without compression?

See Nicholas Pitre's hack on another branch of this thread - it won't
cut out zlib entirely, but at least it's just configuring it to do plain
pass-through.  You can probably just replace pack_compression_level with 0.

Sam

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 21:45                 ` Sam Vilain
@ 2008-01-10 22:03                   ` Linus Torvalds
  2008-01-10 22:28                     ` Sam Vilain
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2008-01-10 22:03 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano



On Fri, 11 Jan 2008, Sam Vilain wrote:
> 
> For reference, 20 years of Perl with very deep deltas:
> 
> wilber:~/src/perl-preview$ du -sk .git
> 73274   .git
> wilber:~/src/perl-preview$ git-repack -a
> Counting objects: 244360, done.
> Compressing objects: 100% (55493/55493), done.
> Writing objects: 100% (244360/244360), done.
> Total 244360 (delta 181061), reused 244360 (delta 181061)
> wilber:~/src/perl-preview$ du -sk .git/objects/pack/
> 75389   .git/objects/pack/

Hmm. I'm not sure I understand what this was supposed to show?

You reused all the old deltas, and you did "du -sk" on two different 
things before/after (and didn't do a "-a -d" to repack the old pack 
either). So does the result actually have anything to do with any 
compression algorithm?

Use "-a -d -f" to repack a whole archive.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 21:51               ` Marco Costalba
  2008-01-10 22:01                 ` Sam Vilain
@ 2008-01-10 22:18                 ` Nicolas Pitre
  1 sibling, 0 replies; 39+ messages in thread
From: Nicolas Pitre @ 2008-01-10 22:18 UTC (permalink / raw)
  To: Marco Costalba
  Cc: Pierre Habouzit, Sam Vilain, Git Mailing List,
	Johannes Schindelin, Junio C Hamano

On Thu, 10 Jan 2008, Marco Costalba wrote:

> On Jan 10, 2008 9:39 PM, Nicolas Pitre <nico@cam.org> wrote:
> >
> > Right.  Abstracting the zlib code and having different compression
> > algorithms tested in the Git context is the only way to do meaningful
> > comparisons.
> >
> 
> The first thing I would like to test when zlib abstraction is ready is
> to test with NULL compressor, i.e. not compression/decompression at
> all and see if 'git log' and friends are happy.

Easy: git config core.compression 0


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 22:03                   ` Linus Torvalds
@ 2008-01-10 22:28                     ` Sam Vilain
  2008-01-10 22:56                       ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2008-01-10 22:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

Linus Torvalds wrote:
>> wilber:~/src/perl-preview$ du -sk .git
>> 73274   .git
>> wilber:~/src/perl-preview$ git-repack -a
>> Counting objects: 244360, done.
>> Compressing objects: 100% (55493/55493), done.
>> Writing objects: 100% (244360/244360), done.
>> Total 244360 (delta 181061), reused 244360 (delta 181061)
>> wilber:~/src/perl-preview$ du -sk .git/objects/pack/
>> 75389   .git/objects/pack/
> 
> Hmm. I'm not sure I understand what this was supposed to show?
> 
> You reused all the old deltas, and you did "du -sk" on two different 
> things before/after (and didn't do a "-a -d" to repack the old pack 
> either). So does the result actually have anything to do with any 
> compression algorithm?
> 
> Use "-a -d -f" to repack a whole archive.

Drat, guess that means I'll have to recompute the deltas - I was trying
to avoid that.

Ok, see you in an hour or two, hopefully sans bonehead mistakes this time :)

Sam.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 22:28                     ` Sam Vilain
@ 2008-01-10 22:56                       ` Linus Torvalds
  2008-01-11  1:01                         ` Sam Vilain
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2008-01-10 22:56 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano



On Fri, 11 Jan 2008, Sam Vilain wrote:
> 
> Drat, guess that means I'll have to recompute the deltas - I was trying
> to avoid that.

Well, you could try to reuse the delta base information itself, but then 
recompute the actual delta data contents. It would require some 
source-code changes, but that may be faster (and result in a more accurate 
before/after picture) than actually recomputing the deltas.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 22:56                       ` Linus Torvalds
@ 2008-01-11  1:01                         ` Sam Vilain
  2008-01-11  2:10                           ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2008-01-11  1:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

Linus Torvalds wrote:
> 
> On Fri, 11 Jan 2008, Sam Vilain wrote:
>> Drat, guess that means I'll have to recompute the deltas - I was trying
>> to avoid that.
> 
> Well, you could try to reuse the delta base information itself, but then 
> recompute the actual delta data contents. It would require some 
> source-code changes, but that may be faster (and result in a more accurate 
> before/after picture) than actually recomputing the deltas.

Yes, it would - but my runs have finished.

Without compression of deltas:

wilber:~/src/perl-preview$ git-repack -a -d -f --window=250 --depth=100
Compressing objects: 100% (236554/236554), done.
Writing objects: 100% (244360/244360), done.
Total 244360 (delta 182343), reused 0 (delta 0)
wilber:~/src/perl-preview$ du -sk .git/objects/pack/
86781   .git/objects/pack/

With compression of deltas:

wilber:~/src/perl-preview$ time git-repack -a -d -f --window=250 --depth=100
Counting objects: 244360, done.
Compressing objects: 100% (236554/236554), done.
Writing objects: 100% (244360/244360), done.
Total 244360 (delta 182343), reused 0 (delta 0)

real    20m34.985s
user    20m1.003s
sys     0m25.558s
wilber:~/src/perl-preview$ du -sk .git/objects/pack/
72907   .git/objects/pack/

wilber:~/src/perl-preview$ git --version
git version 1.5.4.rc2.7.g079c9-dirty

Of course those compression parameters are quite insane.

And as a side note either repack-objects got significantly better about
memory use between 1.5.3.5 and that version (the OOM killer fired -
killing first firefox and thunderbird :)) or apparently running
git-repack with a ulimit stops it from allocating too much VM.

Sam.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-11  1:01                         ` Sam Vilain
@ 2008-01-11  2:10                           ` Linus Torvalds
  2008-01-11  6:29                             ` Sam Vilain
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2008-01-11  2:10 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano



On Fri, 11 Jan 2008, Sam Vilain wrote:
> 
> Without compression of deltas:
> 
> wilber:~/src/perl-preview$ du -sk .git/objects/pack/
> 86781 .git/objects/pack/
> 
> With compression of deltas:
> 
> wilber:~/src/perl-preview$ du -sk .git/objects/pack/
> 72907 .git/objects/pack/

Ok, so non-compressed deltas are 20% bigger.

That may well be a perfectly acceptable trade-off if the end result is 
then a lot faster. Has somebody done performance numbers? I may have 
missed them.. The best test is probably something like "git blame" on a 
file that takes an appreciable amount of time.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-11  2:10                           ` Linus Torvalds
@ 2008-01-11  6:29                             ` Sam Vilain
  2008-01-11  7:05                               ` Sam Vilain
  2008-01-11 16:03                               ` Linus Torvalds
  0 siblings, 2 replies; 39+ messages in thread
From: Sam Vilain @ 2008-01-11  6:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

Linus Torvalds wrote:
> 
> On Fri, 11 Jan 2008, Sam Vilain wrote:
>> Without compression of deltas:
>>
>> wilber:~/src/perl-preview$ du -sk .git/objects/pack/
>> 86781 .git/objects/pack/
>>
>> With compression of deltas:
>>
>> wilber:~/src/perl-preview$ du -sk .git/objects/pack/
>> 72907 .git/objects/pack/
> 
> Ok, so non-compressed deltas are 20% bigger.
> 
> That may well be a perfectly acceptable trade-off if the end result is 
> then a lot faster. Has somebody done performance numbers? I may have 
> missed them.. The best test is probably something like "git blame" on a 
> file that takes an appreciable amount of time.

The difference seems only barely measurable;

wilber:~/src/perl-preview$ time git annotate sv.c >/dev/null

real    0m8.130s
user    0m6.712s
sys     0m1.412s

wilber:~/src/perl-preview-loose$ time git annotate sv.c >/dev/null

real    0m7.930s
user    0m6.480s
sys     0m1.408s

(each one is last of three runs - dual-core x86_64 @ 2.1GHz w/512KB cache)

sv.c has about 1500 revisions, though the oldest line is    I also tried
annotate and log on the YACC generated parser which only has about 165
revisions, with similar results - a very minor difference or no difference.

Sam

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-11  6:29                             ` Sam Vilain
@ 2008-01-11  7:05                               ` Sam Vilain
  2008-01-11 16:03                               ` Linus Torvalds
  1 sibling, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2008-01-11  7:05 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Linus Torvalds, Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

Sam Vilain wrote:
> sv.c has about 1500 revisions, though the oldest line is 

only about 900 revisions old.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 21:30                 ` Nicolas Pitre
@ 2008-01-11  8:57                   ` Pierre Habouzit
  0 siblings, 0 replies; 39+ messages in thread
From: Pierre Habouzit @ 2008-01-11  8:57 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Linus Torvalds, Sam Vilain, Git Mailing List, Johannes Schindelin,
	Marco Costalba, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2891 bytes --]

On Thu, Jan 10, 2008 at 09:30:59PM +0000, Nicolas Pitre wrote:
> On Thu, 10 Jan 2008, Linus Torvalds wrote:
> 
> > 
> > 
> > On Thu, 10 Jan 2008, Nicolas Pitre wrote:
> > > 
> > > Here's my rather surprising results:
> > > 
> > > My kernel repo pack size without the patch:	184275401 bytes
> > > Same repo with the above patch applied:		205204930 bytes
> > > 
> > > So it is only 11% larger.  I was expecting much more.
> > 
> > It's probably worth doing those statistics on some other projects.
> > 
> > Maybe the difference to other repositories isn't huge, and maybe the 
> > kernel *is* a good test-case, but I just wouldn't take that for granted. 
> 
> Obviously.
> 
> This was a really crud test, and my initial goal was to quickly dismiss 
> Pierre's assertion.  Turns out that he wasn't that wrong after all,

  Well that wasn't a random assertion, I made it, because I assumed that
a delta is usually less than a few hundred bytes, and as compression is
applied only to the delta without context, you end up packing 500 bytes
per 500 bytes which will seldomly have excellent compression ratios.

> and 
> if a significant increase in access speed by avoiding zlib for 82% of 
> object accesses can also be demonstrated for the kernel, then we have an 
> opportunity for some optimization tradeoff with no backward 
> compatibility concerns.

  Well, one could use the fact that deltas are not packed to avoid
copying them around, and that will _necessarily_ become a gain (you can
read them where they have been mmapped for instance). The number that
were given for git annotate use a compression of `0' which doesn't use
that fact, and I wouldn't be surprised to see a noticeable gain if one
does that.

  And actually, maybe that it's not the deltas we should not pack, but
objects under a certain size (say 512 bytes e.g. ?), whichever type they
have, and to have the code exploit that fact for real, and avoid copies.
With this criterion, I expect the repository to not grow a lot larger
(I'd say quite less than the 10% you had, as even in the kernel, there
_are_ some larger deltas, and we definitely loose space for them, I'd
expect less than a 5% size variation), and I _think_ it's worth
investigating. At least I expect visible results on commands (like blame
of even log[0]) that go through a lot of small objects to see 10 to 20%
increase speed (backed up by some experience I have in avoiding copies
in not-so-similar cases though, so it may be less, and I'll stand
corrected -- and disappointed, a bit).

  [0] If I'm correct commit messages are "objects" on their own, and I
      don't expect them to be very often over 512 octets.
-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 20:39             ` Nicolas Pitre
  2008-01-10 21:01               ` Linus Torvalds
  2008-01-10 21:51               ` Marco Costalba
@ 2008-01-11  9:45               ` Pierre Habouzit
  2008-01-11 14:27                 ` Nicolas Pitre
  2008-01-11 14:18               ` Morten Welinder
  3 siblings, 1 reply; 39+ messages in thread
From: Pierre Habouzit @ 2008-01-11  9:45 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Sam Vilain, Git Mailing List, Johannes Schindelin, Marco Costalba,
	Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 5671 bytes --]

On Thu, Jan 10, 2008 at 08:39:07PM +0000, Nicolas Pitre wrote:
> On Thu, 10 Jan 2008, Pierre Habouzit wrote:

> diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> index a39cb82..252b03e 100644
> --- a/builtin-pack-objects.c
> +++ b/builtin-pack-objects.c
> @@ -433,7 +433,10 @@ static unsigned long write_object(struct sha1file *f,
>  		}
>  		/* compress the data to store and put compressed length in datalen */
>  		memset(&stream, 0, sizeof(stream));
> -		deflateInit(&stream, pack_compression_level);
> +		if (obj_type == OBJ_REF_DELTA || obj_type == OBJ_OFS_DELTA)
> +			deflateInit(&stream, 0);
> +		else
> +			deflateInit(&stream, pack_compression_level);
>  		maxsize = deflateBound(&stream, size);
>  		out = xmalloc(maxsize);
>  		/* Compress it */
> 
> You then only need to run 'git repack -a -f -d' with and without the 
> above patch.

  Using as a PoC a test that is if (size <= 512) instead, I get:

vanilla git:

$ du -k .git/**/*.pack
180808 .git/objects/pack/pack-7bc9f383c92cbffe366da2d2a62b67bb33a53365.pack
$ repeat 5 time git blame MAINTAINERS >|/dev/null
git blame MAINTAINERS >| /dev/null  7,34s user 0,09s system 99% cpu 7,433 total
git blame MAINTAINERS >| /dev/null  7,31s user 0,16s system 100% cpu 7,475 total
git blame MAINTAINERS >| /dev/null  7,35s user 0,08s system 100% cpu 7,431 total
git blame MAINTAINERS >| /dev/null  7,30s user 0,18s system 99% cpu 7,482 total
git blame MAINTAINERS >| /dev/null  7,33s user 0,16s system 99% cpu 7,492 total


With a compression disabled for sizes <= 512:

$ du -k .git/**/*.pack
188840.git/objects/pack/pack-7bc9f383c92cbffe366da2d2a62b67bb33a53365.pack
$ repeat 5 time git blame MAINTAINERS >|/dev/null
git blame MAINTAINERS >| /dev/null  7,06s user 0,09s system 100% cpu 7,150 total
git blame MAINTAINERS >| /dev/null  7,08s user 0,13s system 99% cpu 7,209 total
git blame MAINTAINERS >| /dev/null  7,07s user 0,08s system 99% cpu 7,168 total
git blame MAINTAINERS >| /dev/null  7,02s user 0,15s system 99% cpu 7,177 total
git blame MAINTAINERS >| /dev/null  7,07s user 0,13s system 99% cpu 7,243 total

Okay, the size doesn't even budge, it's not even near being fun. Though
we gain 3% of wall clock time


Let's try with a limit of 1024 then !

$ du -k .git/**/*.pack
201725	.git/objects/pack/pack-7bc9f383c92cbffe366da2d2a62b67bb33a53365.pack
$ repeat 5 time git blame MAINTAINERS >|/dev/null
git blame MAINTAINERS >| /dev/null  6,93s user 0,16s system 77% cpu 9,109 total
git blame MAINTAINERS >| /dev/null  6,88s user 0,08s system 99% cpu 6,965 total
git blame MAINTAINERS >| /dev/null  6,84s user 0,10s system 99% cpu 6,952 total
git blame MAINTAINERS >| /dev/null  6,86s user 0,12s system 99% cpu 6,983 total
git blame MAINTAINERS >| /dev/null  6,81s user 0,18s system 99% cpu 6,994 total


Okay, the packs grows 10%, and the blame takes 6% less time.


Okay the numbers are still not that impressive, but my patch doesn't
touches _only_ deltas, but also log comments I said, so I've redone my
tests with git log and *TADAAAA*:

vanilla git:
    repeat 5 time git log >|/dev/null
    git log >| /dev/null  2,54s user 0,12s system 99% cpu 2,660 total
    git log >| /dev/null  2,52s user 0,12s system 99% cpu 2,653 total
    git log >| /dev/null  2,57s user 0,07s system 99% cpu 2,637 total
    git log >| /dev/null  2,56s user 0,09s system 99% cpu 2,659 total
    git log >| /dev/null  2,54s user 0,10s system 99% cpu 2,660 total

with the 512 octets limit:

    $ repeat 5 time git log >|/dev/null
    git log >| /dev/null  2,10s user 0,10s system 99% cpu 2,193 total
    git log >| /dev/null  2,08s user 0,10s system 99% cpu 2,189 total
    git log >| /dev/null  2,06s user 0,11s system 100% cpu 2,162 total
    git log >| /dev/null  2,04s user 0,13s system 100% cpu 2,172 total
    git log >| /dev/null  2,06s user 0,13s system 99% cpu 2,198 total

    That's already a 20% time reduction.


with the 1024 octets limits:
    $ repeat 5 time git log >|/dev/null
    git log >| /dev/null  1,39s user 0,12s system 99% cpu 1,512 total
    git log >| /dev/null  1,38s user 0,12s system 100% cpu 1,498 total
    git log >| /dev/null  1,41s user 0,10s system 99% cpu 1,514 total
    git log >| /dev/null  1,41s user 0,10s system 100% cpu 1,506 total
    git log >| /dev/null  1,40s user 0,10s system 100% cpu 1,504 total

    Yes that's 43% time reduction !

  As a side note, repacking with the 1024 octets limits takes 4:06 here,
and 4:26 without the limit at all, which is 8% less time. I know it
doesn't matters a lot as repack is a once time operation, but still, it
would speed up git gc --auto which is not something to neglect
completely.


I say it's worth investigating a _lot_, and the patch is that complicated:

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index a39cb82..f454929 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -433,7 +433,7 @@ static unsigned long write_object(struct sha1file *f,
                }
                /* compress the data to store and put compressed length in datalen */
                memset(&stream, 0, sizeof(stream));
-               deflateInit(&stream, pack_compression_level);
+               deflateInit(&stream, size > 1024 ? pack_compression_level : 0);
                maxsize = deflateBound(&stream, size);
                out = xmalloc(maxsize);
                /* Compress it */


-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-10 20:39             ` Nicolas Pitre
                                 ` (2 preceding siblings ...)
  2008-01-11  9:45               ` Pierre Habouzit
@ 2008-01-11 14:18               ` Morten Welinder
  3 siblings, 0 replies; 39+ messages in thread
From: Morten Welinder @ 2008-01-11 14:18 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Pierre Habouzit, Sam Vilain, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

> This is really the big point here.  Git uses _lots_ of *small* objects,
> usually much smaller than 12KB.  For example, my copy of the gcc
> repository has an average of 270 _bytes_ per compressed object, and
> objects must be individually compressed.
>
> Performance with really small objects should be the basis for any
> Git compression algorithm comparison.

If it so happens that one algorithm does much better on small objects
while another does better on large objects, there really is nothing that
prevents using both in a repository.  It's a bit of code bloat, of course.

Morten

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-11  9:45               ` Pierre Habouzit
@ 2008-01-11 14:27                 ` Nicolas Pitre
  0 siblings, 0 replies; 39+ messages in thread
From: Nicolas Pitre @ 2008-01-11 14:27 UTC (permalink / raw)
  To: Pierre Habouzit
  Cc: Sam Vilain, Git Mailing List, Johannes Schindelin, Marco Costalba,
	Junio C Hamano

On Fri, 11 Jan 2008, Pierre Habouzit wrote:

> Okay the numbers are still not that impressive, but my patch doesn't
> touches _only_ deltas, but also log comments I said, so I've redone my
> tests with git log and *TADAAAA*:
> 
> vanilla git:
>     repeat 5 time git log >|/dev/null
>     git log >| /dev/null  2,54s user 0,12s system 99% cpu 2,660 total
>     git log >| /dev/null  2,52s user 0,12s system 99% cpu 2,653 total
>     git log >| /dev/null  2,57s user 0,07s system 99% cpu 2,637 total
>     git log >| /dev/null  2,56s user 0,09s system 99% cpu 2,659 total
>     git log >| /dev/null  2,54s user 0,10s system 99% cpu 2,660 total
> 
> with the 512 octets limit:
> 
>     $ repeat 5 time git log >|/dev/null
>     git log >| /dev/null  2,10s user 0,10s system 99% cpu 2,193 total
>     git log >| /dev/null  2,08s user 0,10s system 99% cpu 2,189 total
>     git log >| /dev/null  2,06s user 0,11s system 100% cpu 2,162 total
>     git log >| /dev/null  2,04s user 0,13s system 100% cpu 2,172 total
>     git log >| /dev/null  2,06s user 0,13s system 99% cpu 2,198 total
> 
>     That's already a 20% time reduction.

Well, sorry but that doesn't count to me.  The whole 'git log' taking 
around 2 seconds is already hell fast for what it does, and IMHO this is 
not worth increasing the repository storage size for this particular 
work load.

> with the 1024 octets limits:
>     $ repeat 5 time git log >|/dev/null
>     git log >| /dev/null  1,39s user 0,12s system 99% cpu 1,512 total
>     git log >| /dev/null  1,38s user 0,12s system 100% cpu 1,498 total
>     git log >| /dev/null  1,41s user 0,10s system 99% cpu 1,514 total
>     git log >| /dev/null  1,41s user 0,10s system 100% cpu 1,506 total
>     git log >| /dev/null  1,40s user 0,10s system 100% cpu 1,504 total
> 
>     Yes that's 43% time reduction !

If that was 43% reduction of a 10 second operation then sure I would 
agree, like the blame operation typically is.  But otherwise the 
significant storage size increase is not worth the reduction of less 
than a second in absolute time.

>   As a side note, repacking with the 1024 octets limits takes 4:06 here,
> and 4:26 without the limit at all, which is 8% less time. I know it
> doesn't matters a lot as repack is a once time operation, but still, it
> would speed up git gc --auto which is not something to neglect
> completely.

No, I doubt it would.  The bulk of 'git gc --auto' will reuse existing 
pack data which is way different from 'git repack -f'. 

> I say it's worth investigating a _lot_,

Well, I was initially entousiastic about this avenue, but the speed 
performance difference is far from impressive IMHO, given the tradeoff.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-11  6:29                             ` Sam Vilain
  2008-01-11  7:05                               ` Sam Vilain
@ 2008-01-11 16:03                               ` Linus Torvalds
  2008-01-12  1:52                                 ` Sam Vilain
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2008-01-11 16:03 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano



On Fri, 11 Jan 2008, Sam Vilain wrote:
> 
> The difference seems only barely measurable;

Ok. 

It may be that it might help other cases, but that seems unlikely.

The more likely answer is that it's either of:

 - yes, zlib uncompression is noticeable in profiles, but that the 
   cold-cache access is simply the bigger problem, and getting rid of zlib 
   just moves the expense to whatever other thing that needs to access it 
   (memcpy, xdelta apply, whatever)

or

 - I don't know exactly which patch you used (did you just do the 
   "core.deltacompression=0" thing?), and maybe zlib is fairly expensive 
   even for just the setup crud, even when it doesn't really need to be.

but who knows..

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-11 16:03                               ` Linus Torvalds
@ 2008-01-12  1:52                                 ` Sam Vilain
  2008-01-12  2:32                                   ` Nicolas Pitre
  2008-01-12  4:46                                   ` Junio C Hamano
  0 siblings, 2 replies; 39+ messages in thread
From: Sam Vilain @ 2008-01-12  1:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

Linus Torvalds wrote:
> 
> On Fri, 11 Jan 2008, Sam Vilain wrote:
>> The difference seems only barely measurable;
> 
> Ok. 
> 
> It may be that it might help other cases, but that seems unlikely.
> 
> The more likely answer is that it's either of:
> 
>  - yes, zlib uncompression is noticeable in profiles, but that the 
>    cold-cache access is simply the bigger problem, and getting rid of zlib 
>    just moves the expense to whatever other thing that needs to access it 
>    (memcpy, xdelta apply, whatever)
> 
> or
> 
>  - I don't know exactly which patch you used (did you just do the 
>    "core.deltacompression=0" thing?), and maybe zlib is fairly expensive 
>    even for just the setup crud, even when it doesn't really need to be.
> 
> but who knows..

Well, my figures agree with Pierre I think - 6-10% time savings for
'git annotate'.

I think Pierre has hit the nail on the head - that skipping
compression for small objects is a clear win.  He saw the obvious
criterion, really.  I've knocked it up as a config option that doesn't
change the default behaviour below.

I can't help but speculate what benefits having a range of one or two
of the most elite compression algorithms (eg, lzop or even lzma for
the larger blobs) available would be, in general.  eg, if gzip takes a
stream longer than X kb to offer substantial benefits over lzop, lzop
the ones shorter than that.

If the uncompressed objects are clustered in the pack, then they might
stream compress a lot better, should they be tranmitted over a http
transport with gzip encoding.  In packs which should be as small as
possible, with a format change they could be distributed as one
compressed resource.  The ordering of the objects would ideally be
selected such that it results in optimum compression - which could add
a savings akin to bzip2 vs gzip, at the expense of having to scan the
small objects for mini-deltas and arrange them clustering objects
which share these mini-deltas.

Well, interesting ideas anyway :)

Subject: [PATCH] pack-objects: add compressionMinSize option

Objects smaller than a page don't save much space when compressed, and
cause some overhead.  Allow the user to specify a minimum size for
objects before they are compressed.

Credit: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Sam Vilain <sam.vilain@catalyst.net.nz>
---
 Documentation/config.txt |    5 +++++
 builtin-pack-objects.c   |    7 ++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1b6d6d6..245121e 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -734,6 +734,11 @@ pack.compression::
 	compromise between speed and compression (currently equivalent
 	to level 6)."
 
+pack.compressionMinSize::
+	Objects smaller than this are not compressed.  This can make
+	operations that deal with many small objects (such as log)
+	faster.
+
 pack.deltaCacheSize::
 	The maximum memory in bytes used for caching deltas in
 	linkgit:git-pack-objects[1].
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index a39cb82..316b809 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -76,6 +76,7 @@ static int num_preferred_base;
 static struct progress *progress_state;
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 static int pack_compression_seen;
+static int compression_min_size = 0;
 
 static unsigned long delta_cache_size = 0;
 static unsigned long max_delta_cache_size = 0;
@@ -433,7 +434,7 @@ static unsigned long write_object(struct sha1file *f,
 		}
 		/* compress the data to store and put compressed length in datalen */
 		memset(&stream, 0, sizeof(stream));
-		deflateInit(&stream, pack_compression_level);
+		deflateInit(&stream, size >= compression_min_size ? pack_compression_level : 0);
 		maxsize = deflateBound(&stream, size);
 		out = xmalloc(maxsize);
 		/* Compress it */
@@ -1841,6 +1842,10 @@ static int git_pack_config(const char *k, const char *v)
 		pack_compression_seen = 1;
 		return 0;
 	}
+	if (!strcmp(k, "pack.compressionminsize")) {
+		compression_min_size = git_config_int(k, v);
+		return 0;	
+	}
 	if (!strcmp(k, "pack.deltacachesize")) {
 		max_delta_cache_size = git_config_int(k, v);
 		return 0;
-- 
1.5.3.7.2095.gb2448-dirty

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-12  1:52                                 ` Sam Vilain
@ 2008-01-12  2:32                                   ` Nicolas Pitre
  2008-01-12  3:06                                     ` Sam Vilain
  2008-01-12  4:46                                   ` Junio C Hamano
  1 sibling, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2008-01-12  2:32 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Linus Torvalds, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

On Sat, 12 Jan 2008, Sam Vilain wrote:

> Linus Torvalds wrote:
> > 
> > On Fri, 11 Jan 2008, Sam Vilain wrote:
> >> The difference seems only barely measurable;
> > 
> > Ok. 
> > 
> > It may be that it might help other cases, but that seems unlikely.
> > 
> > The more likely answer is that it's either of:
> > 
> >  - yes, zlib uncompression is noticeable in profiles, but that the 
> >    cold-cache access is simply the bigger problem, and getting rid of zlib 
> >    just moves the expense to whatever other thing that needs to access it 
> >    (memcpy, xdelta apply, whatever)
> > 
> > or
> > 
> >  - I don't know exactly which patch you used (did you just do the 
> >    "core.deltacompression=0" thing?), and maybe zlib is fairly expensive 
> >    even for just the setup crud, even when it doesn't really need to be.
> > 
> > but who knows..
> 
> Well, my figures agree with Pierre I think - 6-10% time savings for
> 'git annotate'.
> 
> I think Pierre has hit the nail on the head - that skipping
> compression for small objects is a clear win.  He saw the obvious
> criterion, really.  I've knocked it up as a config option that doesn't
> change the default behaviour below.

Sorry to rain on your parade, but to me 6-10% time saving is not a clear 
win at all, given the equal increase in repository size.  This is simply 
not worth it.

And a 50% time saving on an operation, such a git log, which takes less 
than 2 seconds in absolute time, is not worth the repo size increase 
either.  Going from 2 seconds down to one second doesn't make enough of 
a user experience difference.

If git blame was to go from 10 seconds down to 4 then I'd say this is a 
clear win.  But this is not the case.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-12  2:32                                   ` Nicolas Pitre
@ 2008-01-12  3:06                                     ` Sam Vilain
  2008-01-12 16:09                                       ` Nicolas Pitre
  0 siblings, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2008-01-12  3:06 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Linus Torvalds, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

Nicolas Pitre wrote:
> Sorry to rain on your parade, but to me 6-10% time saving is not a clear 
> win at all, given the equal increase in repository size.  This is simply 
> not worth it.

Agree.

> And a 50% time saving on an operation, such a git log, which takes less 
> than 2 seconds in absolute time, is not worth the repo size increase 
> either.

Disagree.  Going as much as twice as fast for many history operations
for 10% added space sounds like a clear win to me.  We can easily agree
to disagree though - making it a disabled by default config option
allows the user to unroll their packs if they want.

> Going from 2 seconds down to one second doesn't make enough of 
> a user experience difference.

What do you mean?  1 second waiting is far better than 2 seconds
waiting.  And the mmap optimizations have not even begun yet - that
could result in boosts from zero-copy, such as a lighter VM footprint.

> If git blame was to go from 10 seconds down to 4 then I'd say this is a 
> clear win.  But this is not the case.

This is an awesome boost!  Everything feels snappier already :)

maia:~/src/perl.clean$ time git-log | LANG=C wc
 288927  894027 8860916

real    0m0.839s
user    0m0.824s
sys     0m0.144s
maia:~/src/perl.clean$ cd ../perl.clean.loose/
maia:~/src/perl.clean.loose$ time git-log | LANG=C wc
 288927  894027 8860916

real    0m0.515s
user    0m0.504s
sys     0m0.136s

maia:~/src/perl.clean.loose$ du -sk .git/objects/pack/
113484  .git/objects/pack/
maia:~/src/perl.clean.loose$ cd -
/home/samv/src/perl.clean
maia:~/src/perl.clean$ du -sk .git/objects/pack/
107040  .git/objects/pack/
maia:~/src/perl.clean$

Want me to try this on kde.git?

Sam.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-12  1:52                                 ` Sam Vilain
  2008-01-12  2:32                                   ` Nicolas Pitre
@ 2008-01-12  4:46                                   ` Junio C Hamano
  1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2008-01-12  4:46 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Linus Torvalds, Nicolas Pitre, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba

Sam Vilain <sam@vilain.net> writes:

> If the uncompressed objects are clustered in the pack, then they might
> stream compress a lot better, should they be tranmitted over a http
> transport with gzip encoding.

That would only have been a sensible optimization in older
native pack protocol, where we always exploded the transferred
packfile.  However, these days, we tend to keep the packfile and
re-index at the receiving end (http transport never exploded the
packfile and it still doesn't).  When used that way, choosing
object layout in packfile in such a way to ignore recency order
and cluster objects by their delta chain, which you are
advocating to reduce the transfer overhead, is a bad tradeoff.
Your packs will be kept in the form you chose for transport,
which is a layout that hurts the runtime performance.  And you
keep using that suboptimal packs number of times, getting hurt
every time.

> @@ -433,7 +434,7 @@ static unsigned long write_object(struct sha1file *f,
>  		}
>  		/* compress the data to store and put compressed length in datalen */
>  		memset(&stream, 0, sizeof(stream));
> -		deflateInit(&stream, pack_compression_level);
> +		deflateInit(&stream, size >= compression_min_size ? pack_compression_level : 0);
>  		maxsize = deflateBound(&stream, size);
>  		out = xmalloc(maxsize);
>  		/* Compress it */

I very much like the simplicity of the patch.  If such a simple
approach can give us a clear performance gain, I am all for it.

Benchmarks on different repositories need to back that up,
though.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-12  3:06                                     ` Sam Vilain
@ 2008-01-12 16:09                                       ` Nicolas Pitre
  2008-01-12 16:44                                         ` Johannes Schindelin
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2008-01-12 16:09 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Linus Torvalds, Pierre Habouzit, Git Mailing List,
	Johannes Schindelin, Marco Costalba, Junio C Hamano

On Sat, 12 Jan 2008, Sam Vilain wrote:

> Nicolas Pitre wrote:
> > Sorry to rain on your parade, but to me 6-10% time saving is not a clear 
> > win at all, given the equal increase in repository size.  This is simply 
> > not worth it.
> 
> Agree.
> 
> > And a 50% time saving on an operation, such a git log, which takes less 
> > than 2 seconds in absolute time, is not worth the repo size increase 
> > either.
> 
> Disagree.  Going as much as twice as fast for many history operations
> for 10% added space sounds like a clear win to me.

If you can come with a real life scenario, and not simply a simple test 
having little relevance with typical usage, that shows a clear reduction 
in execution time which is human perceptible, then I'll agree with you.  
But doing a full history log taking one second instead of two isn't a 
good enough argument  to me for making the repository many megabytes 
larger.  Again if it was 'git blame' using 5 seconds instead of 10 then 
I would agree that this is a clear win, even if this is also a 50% 
execution time reduction.  But human perception is way more important 
when it is 10 secs down to 5 compared to 2 secs down to 1.

This proposed change isn't free, because you have to introduce a 
regression in one place in order to make a gain somewhere else. The pack 
v4 format that I developed with Shawn, though, was showing _both_ a 
speed gain and a repository size reduction, hence there is no regression 
for the added improvements.  *That* is a clear win.

> We can easily agree
> to disagree though

I suppose we do.

Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Decompression speed: zip vs lzo
  2008-01-12 16:09                                       ` Nicolas Pitre
@ 2008-01-12 16:44                                         ` Johannes Schindelin
  0 siblings, 0 replies; 39+ messages in thread
From: Johannes Schindelin @ 2008-01-12 16:44 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Sam Vilain, Linus Torvalds, Pierre Habouzit, Git Mailing List,
	Marco Costalba, Junio C Hamano

Hi,

On Sat, 12 Jan 2008, Nicolas Pitre wrote:

> On Sat, 12 Jan 2008, Sam Vilain wrote:
> 
> > Going as much as twice as fast for many history operations for 10% 
> > added space sounds like a clear win to me.

I have to agree with Nicolas.  A full history log is such a rare occasion 
that it is not worth optimising for.

When I call "git log", it typically shows me the first commit 
_instantaneously_, which is plenty fast enough for me, especially given 
that I quit it right away or after a few pages more often than not.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2008-01-12 16:45 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-09 22:01 Decompression speed: zip vs lzo Marco Costalba
2008-01-09 22:55 ` Junio C Hamano
2008-01-09 23:23   ` Sam Vilain
2008-01-09 23:31     ` Johannes Schindelin
2008-01-10  1:02       ` Sam Vilain
2008-01-10  5:02         ` Sam Vilain
2008-01-10  9:16           ` Pierre Habouzit
2008-01-10 20:39             ` Nicolas Pitre
2008-01-10 21:01               ` Linus Torvalds
2008-01-10 21:30                 ` Nicolas Pitre
2008-01-11  8:57                   ` Pierre Habouzit
2008-01-10 21:45                 ` Sam Vilain
2008-01-10 22:03                   ` Linus Torvalds
2008-01-10 22:28                     ` Sam Vilain
2008-01-10 22:56                       ` Linus Torvalds
2008-01-11  1:01                         ` Sam Vilain
2008-01-11  2:10                           ` Linus Torvalds
2008-01-11  6:29                             ` Sam Vilain
2008-01-11  7:05                               ` Sam Vilain
2008-01-11 16:03                               ` Linus Torvalds
2008-01-12  1:52                                 ` Sam Vilain
2008-01-12  2:32                                   ` Nicolas Pitre
2008-01-12  3:06                                     ` Sam Vilain
2008-01-12 16:09                                       ` Nicolas Pitre
2008-01-12 16:44                                         ` Johannes Schindelin
2008-01-12  4:46                                   ` Junio C Hamano
2008-01-10 21:51               ` Marco Costalba
2008-01-10 22:01                 ` Sam Vilain
2008-01-10 22:18                 ` Nicolas Pitre
2008-01-11  9:45               ` Pierre Habouzit
2008-01-11 14:27                 ` Nicolas Pitre
2008-01-11 14:18               ` Morten Welinder
2008-01-10  3:41       ` Nicolas Pitre
2008-01-10  6:55         ` Marco Costalba
2008-01-10 11:45           ` Marco Costalba
2008-01-10 12:12             ` Johannes Schindelin
2008-01-10 12:18               ` Marco Costalba
2008-01-10 19:34           ` Dana How
2008-01-09 23:49     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).