git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Compression speed for large files
@ 2006-07-03 11:13 Joachim B Haga
  2006-07-03 12:03 ` Alex Riesen
  2006-07-03 21:45 ` Compression speed for large files Jeff King
  0 siblings, 2 replies; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 11:13 UTC (permalink / raw)
  To: git

I'm looking at doing version control of data files, potentially very large,
often binary. In git, committing of large files is very slow; I have tested with
a 45MB file, which takes about 1 minute to check in (on an intel core-duo 2GHz).

Now, most of the time is spent in compressing the file. Would it be a good idea
to change the Z_BEST_COMPRESSION flag to zlib, at least for large files? I have
measured the time spent by git-commit with different flags in sha1_file.c:

  method                 time (s)  object size (kB)
  Z_BEST_COMPRESSION     62.0      17136
  Z_DEFAULT_COMPRESSION  10.4      16536
  Z_BEST_SPEED            4.8      17071

In this case Z_BEST_COMPRESSION also compresses worse, but that's not the major
issue: the time is. Here's a couple of other data points, measured with gzip -9,
-6 and -1 (comparable to the Z_ flags above):

129MB ascii data file
  method    time (s)  object size (kB)
  gzip -9   158       23066
  gzip -6    18       23619
  gzip -1     6       32304

3MB ascii data file
  gzip -9   2.2        887
  gzip -6   0.7        912
  gzip -1   0.3       1134

So: is it a good idea to change to faster compression, at least for larger
files? From my (limited) testing I would suggest using Z_BEST_COMPRESSION only
for small files (perhaps <1MB?) and Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
larger ones.


-j.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 11:13 Compression speed for large files Joachim B Haga
@ 2006-07-03 12:03 ` Alex Riesen
  2006-07-03 12:42   ` Elrond
  2006-07-03 13:32   ` Joachim Berdal Haga
  2006-07-03 21:45 ` Compression speed for large files Jeff King
  1 sibling, 2 replies; 21+ messages in thread
From: Alex Riesen @ 2006-07-03 12:03 UTC (permalink / raw)
  To: Joachim B Haga; +Cc: git

On 7/3/06, Joachim B Haga <cjhaga@fys.uio.no> wrote:
> So: is it a good idea to change to faster compression, at least for larger
> files? From my (limited) testing I would suggest using Z_BEST_COMPRESSION only
> for small files (perhaps <1MB?) and Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
> larger ones.

Probably yes, as a per-repo config option.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 12:03 ` Alex Riesen
@ 2006-07-03 12:42   ` Elrond
  2006-07-03 13:44     ` Joachim B Haga
  2006-07-03 13:32   ` Joachim Berdal Haga
  1 sibling, 1 reply; 21+ messages in thread
From: Elrond @ 2006-07-03 12:42 UTC (permalink / raw)
  To: git

Joachim B Haga <cjhaga <at> fys.uio.no> writes:
[...]
>   method                 time (s)  object size (kB)
>   Z_BEST_COMPRESSION     62.0      17136
>   Z_DEFAULT_COMPRESSION  10.4      16536
>   Z_BEST_SPEED            4.8      17071
> 
> In this case Z_BEST_COMPRESSION also compresses worse,
[...]

I personally find that very interesting, is this a known "issue" with zlib?
It suggests, that with different options, it's possible to create smaller
repositories, despite the 'advertised' (by zlib, not git) "best" compression.


Alex Riesen <raa.lkml <at> gmail.com> writes:
[...]
> Probably yes, as a per-repo config option.

The option probably should be the size for which to start using
"default" compression.


    Elrond

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 12:03 ` Alex Riesen
  2006-07-03 12:42   ` Elrond
@ 2006-07-03 13:32   ` Joachim Berdal Haga
       [not found]     ` <Pine.LN X.4.64.0607031030150.1213@localhost.localdomain>
  2006-07-03 14:33     ` Nicolas Pitre
  1 sibling, 2 replies; 21+ messages in thread
From: Joachim Berdal Haga @ 2006-07-03 13:32 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git

Alex Riesen wrote:
> On 7/3/06, Joachim B Haga <cjhaga@fys.uio.no> wrote:
>> So: is it a good idea to change to faster compression, at least for 
>> larger files? From my (limited) testing I would suggest using 
>> Z_BEST_COMPRESSION only for small files (perhaps <1MB?) and 
>> Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
>> larger ones.
> 
> Probably yes, as a per-repo config option.

I can send a patch later. If it's to be a per-repo option, it's probably 
too confusing with several values. Is it ok with

core.compression = [-1..9]

where the numbers are the zlib/gzip constants,
   -1 = zlib default (currently 6)
    0 = no compression
1..9 = various speed/size tradeoffs (9 is git default)

Btw; I just tested the kernel sources. With gzip only, but files 
compressed individually:
   time find . -type f | xargs gzip -9 -c | wc -c

I found the space saving from -6 to -9 to be under 0.6%, at double the 
CPU time. So perhaps Z_DEFAULT_COMPRESSION would be good as default.

-j

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 12:42   ` Elrond
@ 2006-07-03 13:44     ` Joachim B Haga
  0 siblings, 0 replies; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 13:44 UTC (permalink / raw)
  To: git

Elrond <elrond+kernel.org <at> samba-tng.org> writes:

> 
> Joachim B Haga <cjhaga <at> fys.uio.no> writes:
> [...]
> > In this case Z_BEST_COMPRESSION also compresses worse,
> [...]
> 
> I personally find that very interesting, is this a known "issue" with zlib?
> It suggests, that with different options, it's possible to create smaller
> repositories, despite the 'advertised' (by zlib, not git) "best" compression.

There are also other tunables in zlib, such as the balance between Huffman
coding (good for data files) and string matching (good for text files). So with
more knowledge of the data it should be possible to compress even better. I'm
not advocating tuning this in git though ;)

> 
> Alex Riesen <raa.lkml <at> gmail.com> writes:
> [...]
> > Probably yes, as a per-repo config option.
> 
> The option probably should be the size for which to start using
> "default" compression.

That is possible, too. I'm open to any decision or consensus, as long as I get
my commits in less than 10s :)

-j.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 13:32   ` Joachim Berdal Haga
       [not found]     ` <Pine.LN X.4.64.0607031030150.1213@localhost.localdomain>
@ 2006-07-03 14:33     ` Nicolas Pitre
  2006-07-03 14:54       ` Yakov Lerner
  2006-07-03 16:31       ` Linus Torvalds
  1 sibling, 2 replies; 21+ messages in thread
From: Nicolas Pitre @ 2006-07-03 14:33 UTC (permalink / raw)
  To: Joachim Berdal Haga; +Cc: Alex Riesen, git

On Mon, 3 Jul 2006, Joachim Berdal Haga wrote:

> Alex Riesen wrote:
> > On 7/3/06, Joachim B Haga <cjhaga@fys.uio.no> wrote:
> > > So: is it a good idea to change to faster compression, at least for larger
> > > files? From my (limited) testing I would suggest using Z_BEST_COMPRESSION
> > > only for small files (perhaps <1MB?) and
> > > Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
> > > larger ones.
> > 
> > Probably yes, as a per-repo config option.
> 
> I can send a patch later. If it's to be a per-repo option, it's probably too
> confusing with several values. Is it ok with
> 
> core.compression = [-1..9]
> 
> where the numbers are the zlib/gzip constants,
>   -1 = zlib default (currently 6)
>    0 = no compression
> 1..9 = various speed/size tradeoffs (9 is git default)

I think this makes a lot of sense, although IMHO I'd simply use 
Z_DEFAULT_COMPRESSION everywhere and be done with it without extra 
complexity which aren't worth the size difference.


Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 14:33     ` Nicolas Pitre
@ 2006-07-03 14:54       ` Yakov Lerner
  2006-07-03 15:17         ` Johannes Schindelin
  2006-07-03 16:31       ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Yakov Lerner @ 2006-07-03 14:54 UTC (permalink / raw)
  Cc: git

On 7/3/06, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 3 Jul 2006, Joachim Berdal Haga wrote:
>
> > Alex Riesen wrote:
> > > On 7/3/06, Joachim B Haga <cjhaga@fys.uio.no> wrote:
> > > > So: is it a good idea to change to faster compression, at least for larger
> > > > files? From my (limited) testing I would suggest using Z_BEST_COMPRESSION
> > > > only for small files (perhaps <1MB?) and
> > > > Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
> > > > larger ones.
> > >
> > > Probably yes, as a per-repo config option.
> >
> > I can send a patch later. If it's to be a per-repo option, it's probably too
> > confusing with several values. Is it ok with
> >
> > core.compression = [-1..9]
> >
> > where the numbers are the zlib/gzip constants,
> >   -1 = zlib default (currently 6)
> >    0 = no compression
> > 1..9 = various speed/size tradeoffs (9 is git default)

It would be arguable whether, say, 10% better compression is worth
x(3-8) slower compression. But 3-4% better compression at the cost of
x(3-8) slower compression time as data suggest ? I think this begs
for switching the default to Z_DEFAULT_COMPRESSION

Yakov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 14:54       ` Yakov Lerner
@ 2006-07-03 15:17         ` Johannes Schindelin
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2006-07-03 15:17 UTC (permalink / raw)
  To: Yakov Lerner; +Cc: git

Hi,

On Mon, 3 Jul 2006, Yakov Lerner wrote:

> It would be arguable whether, say, 10% better compression is worth 
> x(3-8) slower compression. But 3-4% better compression at the cost of 
> x(3-8) slower compression time as data suggest ? I think this begs for 
> switching the default to Z_DEFAULT_COMPRESSION

The real problem, of course, is that you cannot know before you tried, if 
your data is really well compressible or not.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 14:33     ` Nicolas Pitre
  2006-07-03 14:54       ` Yakov Lerner
@ 2006-07-03 16:31       ` Linus Torvalds
  2006-07-03 18:59         ` [PATCH] Make zlib compression level configurable, and change default Joachim B Haga
  2006-07-03 19:02         ` [PATCH] Use configurable zlib compression level everywhere Joachim B Haga
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2006-07-03 16:31 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Joachim Berdal Haga, Alex Riesen, git



On Mon, 3 Jul 2006, Nicolas Pitre wrote:

> On Mon, 3 Jul 2006, Joachim Berdal Haga wrote:
> > 
> > I can send a patch later. If it's to be a per-repo option, it's probably too
> > confusing with several values. Is it ok with
> > 
> > core.compression = [-1..9]
> > 
> > where the numbers are the zlib/gzip constants,
> >   -1 = zlib default (currently 6)
> >    0 = no compression
> > 1..9 = various speed/size tradeoffs (9 is git default)
> 
> I think this makes a lot of sense, although IMHO I'd simply use 
> Z_DEFAULT_COMPRESSION everywhere and be done with it without extra 
> complexity which aren't worth the size difference.

I think Z_DEFAULT_COMPRESSION is fine too - we've long since started 
relying on pack-files and the delta compression for the _real_ size 
improvements, and as such, the zlib compression is less important.

That said, the "core.compression" thing sounds good to me, and gives 
people the ability to tune things for their loads.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] Make zlib compression level configurable, and change default.
  2006-07-03 16:31       ` Linus Torvalds
@ 2006-07-03 18:59         ` Joachim B Haga
  2006-07-03 19:33           ` Linus Torvalds
  2006-07-03 19:02         ` [PATCH] Use configurable zlib compression level everywhere Joachim B Haga
  1 sibling, 1 reply; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 18:59 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre, Linus Torvalds, Alex Riesen, Junio C Hamano

Make zlib compression level configurable, and change the default.

With the change in default, "git add ." on kernel dir is about
twice as fast as before, with only minimal (0.5%) change in
object size. The speed difference is even more noticeable
when committing large files, which is now up to 8 times faster.

The configurability is through setting core.compression = [-1..9]
which maps to the zlib constants; -1 is the default, 0 is no
compression, and 1..9 are various speed/size tradeoffs, 9
being slowest.

Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no)
---
 Documentation/config.txt |    6 ++++++
 cache.h                  |    1 +
 config.c                 |    5 +++++
 environment.c            |    1 +
 sha1_file.c              |    4 ++--
 5 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index a04c5ad..ac89be7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -91,6 +91,12 @@ core.warnAmbiguousRefs::
        If true, git will warn you if the ref name you passed it is ambiguous
        and might match multiple refs in the .git/refs/ tree. True by default.
 
+core.compression:
+       An integer -1..9, indicating the compression level for objects that
+       are not in a pack file. -1 is the zlib and git default. 0 means no 
+       compression, and 1..9 are various speed/size tradeoffs, 9 being
+       slowest.
+
 alias.*::
        Command aliases for the gitlink:git[1] command wrapper - e.g.
        after defining "alias.last = cat-file commit HEAD", the invocation
diff --git a/cache.h b/cache.h
index 8719939..84770bf 100644
--- a/cache.h
+++ b/cache.h
@@ -183,6 +183,7 @@ extern int log_all_ref_updates;
 extern int warn_ambiguous_refs;
 extern int shared_repository;
 extern const char *apply_default_whitespace;
+extern int zlib_compression_level;
 
 #define GIT_REPO_VERSION 0
 extern int repository_format_version;
diff --git a/config.c b/config.c
index ec44827..61563be 100644
--- a/config.c
+++ b/config.c
@@ -279,6 +279,11 @@ int git_default_config(const char *var, 
                return 0;
        }
 
+       if (!strcmp(var, "core.compression")) {
+               zlib_compression_level = git_config_int(var, value);
+               return 0;
+       }
+
        if (!strcmp(var, "user.name")) {
                strlcpy(git_default_name, value, sizeof(git_default_name));
                return 0;
diff --git a/environment.c b/environment.c
index 3de8eb3..1d8ceef 100644
--- a/environment.c
+++ b/environment.c
@@ -20,6 +20,7 @@ int repository_format_version = 0;
 char git_commit_encoding[MAX_ENCODING_LENGTH] = "utf-8";
 int shared_repository = PERM_UMASK;
 const char *apply_default_whitespace = NULL;
+int zlib_compression_level = -1;
 
 static char *git_dir, *git_object_dir, *git_index_file, *git_refs_dir,
        *git_graft_file;
diff --git a/sha1_file.c b/sha1_file.c
index 8179630..bc35808 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1458,7 +1458,7 @@ int write_sha1_file(void *buf, unsigned 
 
        /* Set it up */
        memset(&stream, 0, sizeof(stream));
-       deflateInit(&stream, Z_BEST_COMPRESSION);
+       deflateInit(&stream, zlib_compression_level);
        size = deflateBound(&stream, len+hdrlen);
        compressed = xmalloc(size);
 
@@ -1511,7 +1511,7 @@ static void *repack_object(const unsigne
 
        /* Set it up */
        memset(&stream, 0, sizeof(stream));
-       deflateInit(&stream, Z_BEST_COMPRESSION);
+       deflateInit(&stream, zlib_compression_level);
        size = deflateBound(&stream, len + hdrlen);
        buf = xmalloc(size);
 
-- 
1.4.1.g8fced-dirty

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH] Use configurable zlib compression level everywhere.
  2006-07-03 16:31       ` Linus Torvalds
  2006-07-03 18:59         ` [PATCH] Make zlib compression level configurable, and change default Joachim B Haga
@ 2006-07-03 19:02         ` Joachim B Haga
  2006-07-03 19:43           ` Junio C Hamano
  1 sibling, 1 reply; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 19:02 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre, Linus Torvalds, Alex Riesen, Junio C Hamano

This one I'm not so sure about, it's for completeness. But I don't actually use
git and haven't tested beyond the git add / git commit stage. Still...

Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no)
---
 csum-file.c |    2 +-
 diff.c      |    2 +-
 http-push.c |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index ebaad03..6a7b40f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -122,7 +122,7 @@ int sha1write_compressed(struct sha1file
        void *out;
 
        memset(&stream, 0, sizeof(stream));
-       deflateInit(&stream, Z_DEFAULT_COMPRESSION);
+       deflateInit(&stream, zlib_compression_level);
        maxsize = deflateBound(&stream, size);
        out = xmalloc(maxsize);
 
diff --git a/diff.c b/diff.c
index 5a71489..428ff78 100644
--- a/diff.c
+++ b/diff.c
@@ -583,7 +583,7 @@ static unsigned char *deflate_it(char *d
        z_stream stream;
 
        memset(&stream, 0, sizeof(stream));
-       deflateInit(&stream, Z_BEST_COMPRESSION);
+       deflateInit(&stream, zlib_compression_level);
        bound = deflateBound(&stream, size);
        deflated = xmalloc(bound);
        stream.next_out = deflated;
diff --git a/http-push.c b/http-push.c
index e281f70..f761584 100644
--- a/http-push.c
+++ b/http-push.c
@@ -492,7 +492,7 @@ static void start_put(struct transfer_re
 
        /* Set it up */
        memset(&stream, 0, sizeof(stream));
-       deflateInit(&stream, Z_BEST_COMPRESSION);
+       deflateInit(&stream, zlib_compression_level);
        size = deflateBound(&stream, len + hdrlen);
        request->buffer.buffer = xmalloc(size);
 
-- 
1.4.1.g8fced-dirty

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] Make zlib compression level configurable, and change default.
  2006-07-03 18:59         ` [PATCH] Make zlib compression level configurable, and change default Joachim B Haga
@ 2006-07-03 19:33           ` Linus Torvalds
  2006-07-03 19:50             ` Linus Torvalds
  2006-07-03 20:11             ` Joachim B Haga
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2006-07-03 19:33 UTC (permalink / raw)
  To: Joachim B Haga
  Cc: Nicolas Pitre, Alex Riesen, Junio C Hamano, Git Mailing List



On Mon, 3 Jul 2006, Joachim B Haga wrote:
> 
> The configurability is through setting core.compression = [-1..9]
> which maps to the zlib constants; -1 is the default, 0 is no
> compression, and 1..9 are various speed/size tradeoffs, 9
> being slowest.

My only worry is that this encodes "Z_DEFAULT_COMPRESSION" as being -1, 
which happens to be /true/, but I don't think that's a documented 
interface (you're supposed to use the Z_DEFAULT_COMPRESSION macro, which 
could have any value, and just _happens_ to be -1).

Is it likely to ever change from that -1? Probably not. So I think your 
patch is technically correct, but it might just be nicer if it did 
something like

	..
	if (!strcmp(var, "core.compression")) {
		int level = git_config_int(var, value);
		if (level == -1)
			level = Z_DEFAULT_COMPRESSION;
		else if (level < 0 || level > Z_BEST_COMPRESSION)
			die("bad zlib compression level %d", level);
		zlib_compression_level = level;
		return 0;
	}
	..

which would be safer, and a smart compiler might notice that the -1 case 
ends up being a no-op, and then just generate code AS IF we just had a

	if (level < -1 || level > Z_BEST_COMPRESSION)
		die(...

there.

Oh, and for all the same reasons, we should use

	int zlib_compression_level = Z_BEST_COMPRESSION;

for the default initializer.

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Use configurable zlib compression level everywhere.
  2006-07-03 19:02         ` [PATCH] Use configurable zlib compression level everywhere Joachim B Haga
@ 2006-07-03 19:43           ` Junio C Hamano
  2006-07-07 21:53             ` David Lang
  0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2006-07-03 19:43 UTC (permalink / raw)
  To: Joachim B Haga; +Cc: git

Joachim B Haga <cjhaga@fys.uio.no> writes:

> This one I'm not so sure about, it's for completeness. But I don't actually use
> git and haven't tested beyond the git add / git commit stage. Still...
>
> Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no)

You made a good judgement to notice that these three are
different.

 * sha1write_compressed() in csum-file.c is for producing packs
   and most of the things we compress there are deltas and less
   compressible, so even when core.compression is set to high we
   might be better off using faster compression.

 * diff's deflate_it() is about producing binary diffs (later
   encoded in base85) for textual transfer.  Again it is almost
   always used to compress deltas so the same comment as above
   apply to this.

 * http-push uses it to send compressed whole object, and this
   is only used over the network, so it is plausible that the
   user would want to use different compression level than the
   usual core.compression.

It is fine by me to use the same core.compression to these
three.  If somebody comes up with a workload that benefits from
having different settings for them, we can add separate
variables, falling back on the default core.compression if there
isn't one, as needed.

Thanks for the patches.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Make zlib compression level configurable, and change default.
  2006-07-03 19:33           ` Linus Torvalds
@ 2006-07-03 19:50             ` Linus Torvalds
  2006-07-03 20:11             ` Joachim B Haga
  1 sibling, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2006-07-03 19:50 UTC (permalink / raw)
  To: Joachim B Haga
  Cc: Nicolas Pitre, Alex Riesen, Junio C Hamano, Git Mailing List



On Mon, 3 Jul 2006, Linus Torvalds wrote:
> 
> Oh, and for all the same reasons, we should use
> 
> 	int zlib_compression_level = Z_BEST_COMPRESSION;

That should be Z_DEFAULT_COMPRESSION, of course.

Anyway, I think the patches are ok as-is, and my suggestion to avoid the 
"-1" and use Z_DEFAULT_COMPRESSION is really just an additional comment, 
not anything fundamental.

So Junio, feel free to add an

	Acked-by: Linus Torvalds <torvalds@osdl.org>

regardless of whether also doing that.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Make zlib compression level configurable, and change default.
  2006-07-03 19:33           ` Linus Torvalds
  2006-07-03 19:50             ` Linus Torvalds
@ 2006-07-03 20:11             ` Joachim B Haga
  1 sibling, 0 replies; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 20:11 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Junio C Hamano

Linus Torvalds <torvalds@osdl.org> writes:

> [snip suggested improvements]

Yes, that would be more... thorough. And, especially for the

  int zlib_compression_level = Z_DEFAULT_COMPRESSION;

line, more self-explanatory, too. So here's an updated patch
(replacing the previous) including your suggestions.

-j.

-

Make zlib compression level configurable, and change default.

With the change in default, "git add ." on kernel dir is about
twice as fast as before, with only minimal (0.5%) change in
object size. The speed difference is even more noticeable
when committing large files, which is now up to 8 times faster.

The configurability is through setting core.compression = [-1..9]
which maps to the zlib constants; -1 is the default, 0 is no
compression, and 1..9 are various speed/size tradeoffs, 9
being slowest.

Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no)
---
 Documentation/config.txt |    6 ++++++
 cache.h                  |    1 +
 config.c                 |   10 ++++++++++
 environment.c            |    1 +
 sha1_file.c              |    4 ++--
 5 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index a04c5ad..ac89be7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -91,6 +91,12 @@ core.warnAmbiguousRefs::
        If true, git will warn you if the ref name you passed it is ambiguous
        and might match multiple refs in the .git/refs/ tree. True by default.
 
+core.compression:
+       An integer -1..9, indicating the compression level for objects that
+       are not in a pack file. -1 is the zlib and git default. 0 means no 
+       compression, and 1..9 are various speed/size tradeoffs, 9 being
+       slowest.
+
 alias.*::
        Command aliases for the gitlink:git[1] command wrapper - e.g.
        after defining "alias.last = cat-file commit HEAD", the invocation
diff --git a/cache.h b/cache.h
index 8719939..84770bf 100644
--- a/cache.h
+++ b/cache.h
@@ -183,6 +183,7 @@ extern int log_all_ref_updates;
 extern int warn_ambiguous_refs;
 extern int shared_repository;
 extern const char *apply_default_whitespace;
+extern int zlib_compression_level;
 
 #define GIT_REPO_VERSION 0
 extern int repository_format_version;
diff --git a/config.c b/config.c
index ec44827..b23f4bf 100644
--- a/config.c
+++ b/config.c
@@ -279,6 +279,16 @@ int git_default_config(const char *var, 
                return 0;
        }
 
+       if (!strcmp(var, "core.compression")) {
+               int level = git_config_int(var, value);
+               if (level == -1)
+                       level = Z_DEFAULT_COMPRESSION;
+               else if (level < 0 || level > Z_BEST_COMPRESSION)
+                       die("bad zlib compression level %d", level);
+               zlib_compression_level = level;
+               return 0;
+       }
+
        if (!strcmp(var, "user.name")) {
                strlcpy(git_default_name, value, sizeof(git_default_name));
                return 0;
diff --git a/environment.c b/environment.c
index 3de8eb3..43823ff 100644
--- a/environment.c
+++ b/environment.c
@@ -20,6 +20,7 @@ int repository_format_version = 0;
 char git_commit_encoding[MAX_ENCODING_LENGTH] = "utf-8";
 int shared_repository = PERM_UMASK;
 const char *apply_default_whitespace = NULL;
+int zlib_compression_level = Z_DEFAULT_COMPRESSION;
 
 static char *git_dir, *git_object_dir, *git_index_file, *git_refs_dir,
        *git_graft_file;
diff --git a/sha1_file.c b/sha1_file.c
index 8179630..bc35808 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1458,7 +1458,7 @@ int write_sha1_file(void *buf, unsigned 
 
        /* Set it up */
        memset(&stream, 0, sizeof(stream));
-       deflateInit(&stream, Z_BEST_COMPRESSION);
+       deflateInit(&stream, zlib_compression_level);
        size = deflateBound(&stream, len+hdrlen);
        compressed = xmalloc(size);
 
@@ -1511,7 +1511,7 @@ static void *repack_object(const unsigne
 
        /* Set it up */
        memset(&stream, 0, sizeof(stream));
-       deflateInit(&stream, Z_BEST_COMPRESSION);
+       deflateInit(&stream, zlib_compression_level);
        size = deflateBound(&stream, len + hdrlen);
        buf = xmalloc(size);
 
-- 
1.4.1.g8fced-dirty

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 11:13 Compression speed for large files Joachim B Haga
  2006-07-03 12:03 ` Alex Riesen
@ 2006-07-03 21:45 ` Jeff King
  2006-07-03 22:25   ` Joachim Berdal Haga
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff King @ 2006-07-03 21:45 UTC (permalink / raw)
  To: Joachim B Haga; +Cc: git

On Mon, Jul 03, 2006 at 11:13:34AM +0000, Joachim B Haga wrote:

> often binary. In git, committing of large files is very slow; I have
> tested with a 45MB file, which takes about 1 minute to check in (on an
> intel core-duo 2GHz).

I know this has already been somewhat solved, but I found your numbers
curiously high. I work quite a bit with git and large files and I
haven't noticed this slowdown. Can you be more specific about your load?
Are you sure it is zlib?

On my 1.8Ghz Athlon, compressing 45MB of zeros into 20K takes about 2s.
Compressing 45MB of random data into a 45MB object takes 6.3s. In either
case, the commit takes only about 0.5s (since cogito stores the object
during the cg-add).

Is there some specific file pattern which is slow to compress? 

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 21:45 ` Compression speed for large files Jeff King
@ 2006-07-03 22:25   ` Joachim Berdal Haga
  2006-07-03 23:02     ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Joachim Berdal Haga @ 2006-07-03 22:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Joachim B Haga, git

Jeff King wrote:
> On Mon, Jul 03, 2006 at 11:13:34AM +0000, Joachim B Haga wrote:
> 
>> often binary. In git, committing of large files is very slow; I have
>> tested with a 45MB file, which takes about 1 minute to check in (on an
>> intel core-duo 2GHz).
> 
> I know this has already been somewhat solved, but I found your numbers
> curiously high. I work quite a bit with git and large files and I
> haven't noticed this slowdown. Can you be more specific about your load?
> Are you sure it is zlib?

Quite sure: at least to the extent that it is fixed by lowering the
compression level. But the wording was inexact: it's during object
creation, which happens at initial "git add" and then later during "git
commit".

But...

> y 1.8Ghz Athlon, compressing 45MB of zeros into 20K takes about 2s.
> Compressing 45MB of random data into a 45MB object takes 6.3s. In either
> case, the commit takes only about 0.5s (since cogito stores the object
> during the cg-add).
> 
> Is there some specific file pattern which is slow to compress? 

yes, it seems so. At least the effect is much more pronounced for my
files than for random/null data. "My" files are in this context generated
data files, binary or ascii.

Here's a test with "time gzip -[169] -c file >/dev/null". Random data
from /dev/urandom, kernel headers are concatenation of *.h in kernel
sources. All times in seconds, on my puny home computer (1GHz Via Nehemiah)

       random (23MB)  data (23MB)   headers (44MB)
-9     10.2           72.5          38.5
-6     10.2           13.5          12.9
-1      9.9            4.1           7.0

So... data dependent, yes. But it hits even for normal source code.

(Btw; the default (-6) seems to be less data dependent than the other
values. Maybe that's on purpose.)

If you want to look at a highly-variable dataset (the one above), try
http://lupus.ig3.net/SIMULATION.dx.gz (5MB, slow server), but that's just
an example, I see the same variability for example also on binary data files.

-j.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 22:25   ` Joachim Berdal Haga
@ 2006-07-03 23:02     ` Linus Torvalds
  2006-07-04  5:42       ` Joachim Berdal Haga
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2006-07-03 23:02 UTC (permalink / raw)
  To: Joachim Berdal Haga; +Cc: Jeff King, Joachim B Haga, git



On Tue, 4 Jul 2006, Joachim Berdal Haga wrote:
> 
> Here's a test with "time gzip -[169] -c file >/dev/null". Random data
> from /dev/urandom, kernel headers are concatenation of *.h in kernel
> sources. All times in seconds, on my puny home computer (1GHz Via Nehemiah)

That "Via Nehemiah" is probably a big part of it.

I think the VIA Nehemiah just has a 64kB L2 cache, and I bet performance 
plummets if the tables end up being used past that. 

And I think a large part of the higher compressions is that they allow the 
compression window and tables to grow bigger.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Compression speed for large files
  2006-07-03 23:02     ` Linus Torvalds
@ 2006-07-04  5:42       ` Joachim Berdal Haga
  0 siblings, 0 replies; 21+ messages in thread
From: Joachim Berdal Haga @ 2006-07-04  5:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, git

Linus Torvalds wrote:
> 
> On Tue, 4 Jul 2006, Joachim Berdal Haga wrote:
>> Here's a test with "time gzip -[169] -c file >/dev/null". Random data
>> from /dev/urandom, kernel headers are concatenation of *.h in kernel
>> sources. All times in seconds, on my puny home computer (1GHz Via Nehemiah)
> 
> That "Via Nehemiah" is probably a big part of it.
> 
> I think the VIA Nehemiah just has a 64kB L2 cache, and I bet performance 
> plummets if the tables end up being used past that. 

Not really. The numbers in my original post were from a Intel core-duo,
they were: 158/18/6 s for comparable (but larger) data.

And on a P4 1.8GHz with 512kB L2, the same 23MB data file compresses in
28.1/5.9/1.3 s. That's a factor 22 slowest/fastest; the VIA was only
factor 18, so the difference is actually *larger*.

-j.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Use configurable zlib compression level everywhere.
  2006-07-03 19:43           ` Junio C Hamano
@ 2006-07-07 21:53             ` David Lang
  2006-07-08  2:10               ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: David Lang @ 2006-07-07 21:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Joachim B Haga, git

On Mon, 3 Jul 2006, Junio C Hamano wrote:

> * sha1write_compressed() in csum-file.c is for producing packs
>   and most of the things we compress there are deltas and less
>   compressible, so even when core.compression is set to high we
>   might be better off using faster compression.

why would deltas have poor compression? I'd expect them to have about the same 
as the files they are deltas of (or slightly better due to the fact that the 
deta metainfo is highly repetitive)

David Lang

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Use configurable zlib compression level everywhere.
  2006-07-07 21:53             ` David Lang
@ 2006-07-08  2:10               ` Johannes Schindelin
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2006-07-08  2:10 UTC (permalink / raw)
  To: David Lang; +Cc: Junio C Hamano, Joachim B Haga, git

Hi,

On Fri, 7 Jul 2006, David Lang wrote:

> On Mon, 3 Jul 2006, Junio C Hamano wrote:
> 
> > * sha1write_compressed() in csum-file.c is for producing packs
> >   and most of the things we compress there are deltas and less
> >   compressible, so even when core.compression is set to high we
> >   might be better off using faster compression.
> 
> why would deltas have poor compression? I'd expect them to have about the same
> as the files they are deltas of (or slightly better due to the fact that the
> deta metainfo is highly repetitive)

Deltas should have poor compression by definition, because compression 
tries to encode those parts of the file more efficiently, which do not 
bear much information (think entropy).

If you have deltas which really make sense, they are almost _pure_ 
information, i.e. they do not contain much redundancy, as compared to real 
files. So, the compression (which does not know anything about the 
characteristics of deltas in particular) cannot take much redundancy out 
of the delta. Therefore, the entropy is very high, and the compression 
rate is low.

Hope this makes sense to you,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2006-07-08  2:10 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-03 11:13 Compression speed for large files Joachim B Haga
2006-07-03 12:03 ` Alex Riesen
2006-07-03 12:42   ` Elrond
2006-07-03 13:44     ` Joachim B Haga
2006-07-03 13:32   ` Joachim Berdal Haga
     [not found]     ` <Pine.LN X.4.64.0607031030150.1213@localhost.localdomain>
2006-07-03 14:33     ` Nicolas Pitre
2006-07-03 14:54       ` Yakov Lerner
2006-07-03 15:17         ` Johannes Schindelin
2006-07-03 16:31       ` Linus Torvalds
2006-07-03 18:59         ` [PATCH] Make zlib compression level configurable, and change default Joachim B Haga
2006-07-03 19:33           ` Linus Torvalds
2006-07-03 19:50             ` Linus Torvalds
2006-07-03 20:11             ` Joachim B Haga
2006-07-03 19:02         ` [PATCH] Use configurable zlib compression level everywhere Joachim B Haga
2006-07-03 19:43           ` Junio C Hamano
2006-07-07 21:53             ` David Lang
2006-07-08  2:10               ` Johannes Schindelin
2006-07-03 21:45 ` Compression speed for large files Jeff King
2006-07-03 22:25   ` Joachim Berdal Haga
2006-07-03 23:02     ` Linus Torvalds
2006-07-04  5:42       ` Joachim Berdal Haga

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).