* Compression speed for large files
@ 2006-07-03 11:13 Joachim B Haga
2006-07-03 12:03 ` Alex Riesen
2006-07-03 21:45 ` Compression speed for large files Jeff King
0 siblings, 2 replies; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 11:13 UTC (permalink / raw)
To: git
I'm looking at doing version control of data files, potentially very large,
often binary. In git, committing of large files is very slow; I have tested with
a 45MB file, which takes about 1 minute to check in (on an intel core-duo 2GHz).
Now, most of the time is spent in compressing the file. Would it be a good idea
to change the Z_BEST_COMPRESSION flag to zlib, at least for large files? I have
measured the time spent by git-commit with different flags in sha1_file.c:
method time (s) object size (kB)
Z_BEST_COMPRESSION 62.0 17136
Z_DEFAULT_COMPRESSION 10.4 16536
Z_BEST_SPEED 4.8 17071
In this case Z_BEST_COMPRESSION also compresses worse, but that's not the major
issue: the time is. Here's a couple of other data points, measured with gzip -9,
-6 and -1 (comparable to the Z_ flags above):
129MB ascii data file
method time (s) object size (kB)
gzip -9 158 23066
gzip -6 18 23619
gzip -1 6 32304
3MB ascii data file
gzip -9 2.2 887
gzip -6 0.7 912
gzip -1 0.3 1134
So: is it a good idea to change to faster compression, at least for larger
files? From my (limited) testing I would suggest using Z_BEST_COMPRESSION only
for small files (perhaps <1MB?) and Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
larger ones.
-j.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 11:13 Compression speed for large files Joachim B Haga
@ 2006-07-03 12:03 ` Alex Riesen
2006-07-03 12:42 ` Elrond
2006-07-03 13:32 ` Joachim Berdal Haga
2006-07-03 21:45 ` Compression speed for large files Jeff King
1 sibling, 2 replies; 21+ messages in thread
From: Alex Riesen @ 2006-07-03 12:03 UTC (permalink / raw)
To: Joachim B Haga; +Cc: git
On 7/3/06, Joachim B Haga <cjhaga@fys.uio.no> wrote:
> So: is it a good idea to change to faster compression, at least for larger
> files? From my (limited) testing I would suggest using Z_BEST_COMPRESSION only
> for small files (perhaps <1MB?) and Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
> larger ones.
Probably yes, as a per-repo config option.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 12:03 ` Alex Riesen
@ 2006-07-03 12:42 ` Elrond
2006-07-03 13:44 ` Joachim B Haga
2006-07-03 13:32 ` Joachim Berdal Haga
1 sibling, 1 reply; 21+ messages in thread
From: Elrond @ 2006-07-03 12:42 UTC (permalink / raw)
To: git
Joachim B Haga <cjhaga <at> fys.uio.no> writes:
[...]
> method time (s) object size (kB)
> Z_BEST_COMPRESSION 62.0 17136
> Z_DEFAULT_COMPRESSION 10.4 16536
> Z_BEST_SPEED 4.8 17071
>
> In this case Z_BEST_COMPRESSION also compresses worse,
[...]
I personally find that very interesting, is this a known "issue" with zlib?
It suggests, that with different options, it's possible to create smaller
repositories, despite the 'advertised' (by zlib, not git) "best" compression.
Alex Riesen <raa.lkml <at> gmail.com> writes:
[...]
> Probably yes, as a per-repo config option.
The option probably should be the size for which to start using
"default" compression.
Elrond
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 12:03 ` Alex Riesen
2006-07-03 12:42 ` Elrond
@ 2006-07-03 13:32 ` Joachim Berdal Haga
[not found] ` <Pine.LN X.4.64.0607031030150.1213@localhost.localdomain>
2006-07-03 14:33 ` Nicolas Pitre
1 sibling, 2 replies; 21+ messages in thread
From: Joachim Berdal Haga @ 2006-07-03 13:32 UTC (permalink / raw)
To: Alex Riesen; +Cc: git
Alex Riesen wrote:
> On 7/3/06, Joachim B Haga <cjhaga@fys.uio.no> wrote:
>> So: is it a good idea to change to faster compression, at least for
>> larger files? From my (limited) testing I would suggest using
>> Z_BEST_COMPRESSION only for small files (perhaps <1MB?) and
>> Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
>> larger ones.
>
> Probably yes, as a per-repo config option.
I can send a patch later. If it's to be a per-repo option, it's probably
too confusing with several values. Is it ok with
core.compression = [-1..9]
where the numbers are the zlib/gzip constants,
-1 = zlib default (currently 6)
0 = no compression
1..9 = various speed/size tradeoffs (9 is git default)
Btw; I just tested the kernel sources. With gzip only, but files
compressed individually:
time find . -type f | xargs gzip -9 -c | wc -c
I found the space saving from -6 to -9 to be under 0.6%, at double the
CPU time. So perhaps Z_DEFAULT_COMPRESSION would be good as default.
-j
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 12:42 ` Elrond
@ 2006-07-03 13:44 ` Joachim B Haga
0 siblings, 0 replies; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 13:44 UTC (permalink / raw)
To: git
Elrond <elrond+kernel.org <at> samba-tng.org> writes:
>
> Joachim B Haga <cjhaga <at> fys.uio.no> writes:
> [...]
> > In this case Z_BEST_COMPRESSION also compresses worse,
> [...]
>
> I personally find that very interesting, is this a known "issue" with zlib?
> It suggests, that with different options, it's possible to create smaller
> repositories, despite the 'advertised' (by zlib, not git) "best" compression.
There are also other tunables in zlib, such as the balance between Huffman
coding (good for data files) and string matching (good for text files). So with
more knowledge of the data it should be possible to compress even better. I'm
not advocating tuning this in git though ;)
>
> Alex Riesen <raa.lkml <at> gmail.com> writes:
> [...]
> > Probably yes, as a per-repo config option.
>
> The option probably should be the size for which to start using
> "default" compression.
That is possible, too. I'm open to any decision or consensus, as long as I get
my commits in less than 10s :)
-j.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 13:32 ` Joachim Berdal Haga
[not found] ` <Pine.LN X.4.64.0607031030150.1213@localhost.localdomain>
@ 2006-07-03 14:33 ` Nicolas Pitre
2006-07-03 14:54 ` Yakov Lerner
2006-07-03 16:31 ` Linus Torvalds
1 sibling, 2 replies; 21+ messages in thread
From: Nicolas Pitre @ 2006-07-03 14:33 UTC (permalink / raw)
To: Joachim Berdal Haga; +Cc: Alex Riesen, git
On Mon, 3 Jul 2006, Joachim Berdal Haga wrote:
> Alex Riesen wrote:
> > On 7/3/06, Joachim B Haga <cjhaga@fys.uio.no> wrote:
> > > So: is it a good idea to change to faster compression, at least for larger
> > > files? From my (limited) testing I would suggest using Z_BEST_COMPRESSION
> > > only for small files (perhaps <1MB?) and
> > > Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
> > > larger ones.
> >
> > Probably yes, as a per-repo config option.
>
> I can send a patch later. If it's to be a per-repo option, it's probably too
> confusing with several values. Is it ok with
>
> core.compression = [-1..9]
>
> where the numbers are the zlib/gzip constants,
> -1 = zlib default (currently 6)
> 0 = no compression
> 1..9 = various speed/size tradeoffs (9 is git default)
I think this makes a lot of sense, although IMHO I'd simply use
Z_DEFAULT_COMPRESSION everywhere and be done with it without extra
complexity which aren't worth the size difference.
Nicolas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 14:33 ` Nicolas Pitre
@ 2006-07-03 14:54 ` Yakov Lerner
2006-07-03 15:17 ` Johannes Schindelin
2006-07-03 16:31 ` Linus Torvalds
1 sibling, 1 reply; 21+ messages in thread
From: Yakov Lerner @ 2006-07-03 14:54 UTC (permalink / raw)
Cc: git
On 7/3/06, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 3 Jul 2006, Joachim Berdal Haga wrote:
>
> > Alex Riesen wrote:
> > > On 7/3/06, Joachim B Haga <cjhaga@fys.uio.no> wrote:
> > > > So: is it a good idea to change to faster compression, at least for larger
> > > > files? From my (limited) testing I would suggest using Z_BEST_COMPRESSION
> > > > only for small files (perhaps <1MB?) and
> > > > Z_DEFAULT_COMPRESSION/Z_BEST_SPEED for
> > > > larger ones.
> > >
> > > Probably yes, as a per-repo config option.
> >
> > I can send a patch later. If it's to be a per-repo option, it's probably too
> > confusing with several values. Is it ok with
> >
> > core.compression = [-1..9]
> >
> > where the numbers are the zlib/gzip constants,
> > -1 = zlib default (currently 6)
> > 0 = no compression
> > 1..9 = various speed/size tradeoffs (9 is git default)
It would be arguable whether, say, 10% better compression is worth
x(3-8) slower compression. But 3-4% better compression at the cost of
x(3-8) slower compression time as data suggest ? I think this begs
for switching the default to Z_DEFAULT_COMPRESSION
Yakov
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 14:54 ` Yakov Lerner
@ 2006-07-03 15:17 ` Johannes Schindelin
0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2006-07-03 15:17 UTC (permalink / raw)
To: Yakov Lerner; +Cc: git
Hi,
On Mon, 3 Jul 2006, Yakov Lerner wrote:
> It would be arguable whether, say, 10% better compression is worth
> x(3-8) slower compression. But 3-4% better compression at the cost of
> x(3-8) slower compression time as data suggest ? I think this begs for
> switching the default to Z_DEFAULT_COMPRESSION
The real problem, of course, is that you cannot know before you tried, if
your data is really well compressible or not.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 14:33 ` Nicolas Pitre
2006-07-03 14:54 ` Yakov Lerner
@ 2006-07-03 16:31 ` Linus Torvalds
2006-07-03 18:59 ` [PATCH] Make zlib compression level configurable, and change default Joachim B Haga
2006-07-03 19:02 ` [PATCH] Use configurable zlib compression level everywhere Joachim B Haga
1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2006-07-03 16:31 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Joachim Berdal Haga, Alex Riesen, git
On Mon, 3 Jul 2006, Nicolas Pitre wrote:
> On Mon, 3 Jul 2006, Joachim Berdal Haga wrote:
> >
> > I can send a patch later. If it's to be a per-repo option, it's probably too
> > confusing with several values. Is it ok with
> >
> > core.compression = [-1..9]
> >
> > where the numbers are the zlib/gzip constants,
> > -1 = zlib default (currently 6)
> > 0 = no compression
> > 1..9 = various speed/size tradeoffs (9 is git default)
>
> I think this makes a lot of sense, although IMHO I'd simply use
> Z_DEFAULT_COMPRESSION everywhere and be done with it without extra
> complexity which aren't worth the size difference.
I think Z_DEFAULT_COMPRESSION is fine too - we've long since started
relying on pack-files and the delta compression for the _real_ size
improvements, and as such, the zlib compression is less important.
That said, the "core.compression" thing sounds good to me, and gives
people the ability to tune things for their loads.
Linus
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] Make zlib compression level configurable, and change default.
2006-07-03 16:31 ` Linus Torvalds
@ 2006-07-03 18:59 ` Joachim B Haga
2006-07-03 19:33 ` Linus Torvalds
2006-07-03 19:02 ` [PATCH] Use configurable zlib compression level everywhere Joachim B Haga
1 sibling, 1 reply; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 18:59 UTC (permalink / raw)
To: git; +Cc: Nicolas Pitre, Linus Torvalds, Alex Riesen, Junio C Hamano
Make zlib compression level configurable, and change the default.
With the change in default, "git add ." on kernel dir is about
twice as fast as before, with only minimal (0.5%) change in
object size. The speed difference is even more noticeable
when committing large files, which is now up to 8 times faster.
The configurability is through setting core.compression = [-1..9]
which maps to the zlib constants; -1 is the default, 0 is no
compression, and 1..9 are various speed/size tradeoffs, 9
being slowest.
Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no)
---
Documentation/config.txt | 6 ++++++
cache.h | 1 +
config.c | 5 +++++
environment.c | 1 +
sha1_file.c | 4 ++--
5 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/Documentation/config.txt b/Documentation/config.txt
index a04c5ad..ac89be7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -91,6 +91,12 @@ core.warnAmbiguousRefs::
If true, git will warn you if the ref name you passed it is ambiguous
and might match multiple refs in the .git/refs/ tree. True by default.
+core.compression:
+ An integer -1..9, indicating the compression level for objects that
+ are not in a pack file. -1 is the zlib and git default. 0 means no
+ compression, and 1..9 are various speed/size tradeoffs, 9 being
+ slowest.
+
alias.*::
Command aliases for the gitlink:git[1] command wrapper - e.g.
after defining "alias.last = cat-file commit HEAD", the invocation
diff --git a/cache.h b/cache.h
index 8719939..84770bf 100644
--- a/cache.h
+++ b/cache.h
@@ -183,6 +183,7 @@ extern int log_all_ref_updates;
extern int warn_ambiguous_refs;
extern int shared_repository;
extern const char *apply_default_whitespace;
+extern int zlib_compression_level;
#define GIT_REPO_VERSION 0
extern int repository_format_version;
diff --git a/config.c b/config.c
index ec44827..61563be 100644
--- a/config.c
+++ b/config.c
@@ -279,6 +279,11 @@ int git_default_config(const char *var,
return 0;
}
+ if (!strcmp(var, "core.compression")) {
+ zlib_compression_level = git_config_int(var, value);
+ return 0;
+ }
+
if (!strcmp(var, "user.name")) {
strlcpy(git_default_name, value, sizeof(git_default_name));
return 0;
diff --git a/environment.c b/environment.c
index 3de8eb3..1d8ceef 100644
--- a/environment.c
+++ b/environment.c
@@ -20,6 +20,7 @@ int repository_format_version = 0;
char git_commit_encoding[MAX_ENCODING_LENGTH] = "utf-8";
int shared_repository = PERM_UMASK;
const char *apply_default_whitespace = NULL;
+int zlib_compression_level = -1;
static char *git_dir, *git_object_dir, *git_index_file, *git_refs_dir,
*git_graft_file;
diff --git a/sha1_file.c b/sha1_file.c
index 8179630..bc35808 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1458,7 +1458,7 @@ int write_sha1_file(void *buf, unsigned
/* Set it up */
memset(&stream, 0, sizeof(stream));
- deflateInit(&stream, Z_BEST_COMPRESSION);
+ deflateInit(&stream, zlib_compression_level);
size = deflateBound(&stream, len+hdrlen);
compressed = xmalloc(size);
@@ -1511,7 +1511,7 @@ static void *repack_object(const unsigne
/* Set it up */
memset(&stream, 0, sizeof(stream));
- deflateInit(&stream, Z_BEST_COMPRESSION);
+ deflateInit(&stream, zlib_compression_level);
size = deflateBound(&stream, len + hdrlen);
buf = xmalloc(size);
--
1.4.1.g8fced-dirty
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH] Use configurable zlib compression level everywhere.
2006-07-03 16:31 ` Linus Torvalds
2006-07-03 18:59 ` [PATCH] Make zlib compression level configurable, and change default Joachim B Haga
@ 2006-07-03 19:02 ` Joachim B Haga
2006-07-03 19:43 ` Junio C Hamano
1 sibling, 1 reply; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 19:02 UTC (permalink / raw)
To: git; +Cc: Nicolas Pitre, Linus Torvalds, Alex Riesen, Junio C Hamano
This one I'm not so sure about, it's for completeness. But I don't actually use
git and haven't tested beyond the git add / git commit stage. Still...
Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no)
---
csum-file.c | 2 +-
diff.c | 2 +-
http-push.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/csum-file.c b/csum-file.c
index ebaad03..6a7b40f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -122,7 +122,7 @@ int sha1write_compressed(struct sha1file
void *out;
memset(&stream, 0, sizeof(stream));
- deflateInit(&stream, Z_DEFAULT_COMPRESSION);
+ deflateInit(&stream, zlib_compression_level);
maxsize = deflateBound(&stream, size);
out = xmalloc(maxsize);
diff --git a/diff.c b/diff.c
index 5a71489..428ff78 100644
--- a/diff.c
+++ b/diff.c
@@ -583,7 +583,7 @@ static unsigned char *deflate_it(char *d
z_stream stream;
memset(&stream, 0, sizeof(stream));
- deflateInit(&stream, Z_BEST_COMPRESSION);
+ deflateInit(&stream, zlib_compression_level);
bound = deflateBound(&stream, size);
deflated = xmalloc(bound);
stream.next_out = deflated;
diff --git a/http-push.c b/http-push.c
index e281f70..f761584 100644
--- a/http-push.c
+++ b/http-push.c
@@ -492,7 +492,7 @@ static void start_put(struct transfer_re
/* Set it up */
memset(&stream, 0, sizeof(stream));
- deflateInit(&stream, Z_BEST_COMPRESSION);
+ deflateInit(&stream, zlib_compression_level);
size = deflateBound(&stream, len + hdrlen);
request->buffer.buffer = xmalloc(size);
--
1.4.1.g8fced-dirty
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH] Make zlib compression level configurable, and change default.
2006-07-03 18:59 ` [PATCH] Make zlib compression level configurable, and change default Joachim B Haga
@ 2006-07-03 19:33 ` Linus Torvalds
2006-07-03 19:50 ` Linus Torvalds
2006-07-03 20:11 ` Joachim B Haga
0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2006-07-03 19:33 UTC (permalink / raw)
To: Joachim B Haga
Cc: Nicolas Pitre, Alex Riesen, Junio C Hamano, Git Mailing List
On Mon, 3 Jul 2006, Joachim B Haga wrote:
>
> The configurability is through setting core.compression = [-1..9]
> which maps to the zlib constants; -1 is the default, 0 is no
> compression, and 1..9 are various speed/size tradeoffs, 9
> being slowest.
My only worry is that this encodes "Z_DEFAULT_COMPRESSION" as being -1,
which happens to be /true/, but I don't think that's a documented
interface (you're supposed to use the Z_DEFAULT_COMPRESSION macro, which
could have any value, and just _happens_ to be -1).
Is it likely to ever change from that -1? Probably not. So I think your
patch is technically correct, but it might just be nicer if it did
something like
..
if (!strcmp(var, "core.compression")) {
int level = git_config_int(var, value);
if (level == -1)
level = Z_DEFAULT_COMPRESSION;
else if (level < 0 || level > Z_BEST_COMPRESSION)
die("bad zlib compression level %d", level);
zlib_compression_level = level;
return 0;
}
..
which would be safer, and a smart compiler might notice that the -1 case
ends up being a no-op, and then just generate code AS IF we just had a
if (level < -1 || level > Z_BEST_COMPRESSION)
die(...
there.
Oh, and for all the same reasons, we should use
int zlib_compression_level = Z_BEST_COMPRESSION;
for the default initializer.
Hmm?
Linus
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] Use configurable zlib compression level everywhere.
2006-07-03 19:02 ` [PATCH] Use configurable zlib compression level everywhere Joachim B Haga
@ 2006-07-03 19:43 ` Junio C Hamano
2006-07-07 21:53 ` David Lang
0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2006-07-03 19:43 UTC (permalink / raw)
To: Joachim B Haga; +Cc: git
Joachim B Haga <cjhaga@fys.uio.no> writes:
> This one I'm not so sure about, it's for completeness. But I don't actually use
> git and haven't tested beyond the git add / git commit stage. Still...
>
> Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no)
You made a good judgement to notice that these three are
different.
* sha1write_compressed() in csum-file.c is for producing packs
and most of the things we compress there are deltas and less
compressible, so even when core.compression is set to high we
might be better off using faster compression.
* diff's deflate_it() is about producing binary diffs (later
encoded in base85) for textual transfer. Again it is almost
always used to compress deltas so the same comment as above
apply to this.
* http-push uses it to send compressed whole object, and this
is only used over the network, so it is plausible that the
user would want to use different compression level than the
usual core.compression.
It is fine by me to use the same core.compression to these
three. If somebody comes up with a workload that benefits from
having different settings for them, we can add separate
variables, falling back on the default core.compression if there
isn't one, as needed.
Thanks for the patches.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] Make zlib compression level configurable, and change default.
2006-07-03 19:33 ` Linus Torvalds
@ 2006-07-03 19:50 ` Linus Torvalds
2006-07-03 20:11 ` Joachim B Haga
1 sibling, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2006-07-03 19:50 UTC (permalink / raw)
To: Joachim B Haga
Cc: Nicolas Pitre, Alex Riesen, Junio C Hamano, Git Mailing List
On Mon, 3 Jul 2006, Linus Torvalds wrote:
>
> Oh, and for all the same reasons, we should use
>
> int zlib_compression_level = Z_BEST_COMPRESSION;
That should be Z_DEFAULT_COMPRESSION, of course.
Anyway, I think the patches are ok as-is, and my suggestion to avoid the
"-1" and use Z_DEFAULT_COMPRESSION is really just an additional comment,
not anything fundamental.
So Junio, feel free to add an
Acked-by: Linus Torvalds <torvalds@osdl.org>
regardless of whether also doing that.
Linus
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] Make zlib compression level configurable, and change default.
2006-07-03 19:33 ` Linus Torvalds
2006-07-03 19:50 ` Linus Torvalds
@ 2006-07-03 20:11 ` Joachim B Haga
1 sibling, 0 replies; 21+ messages in thread
From: Joachim B Haga @ 2006-07-03 20:11 UTC (permalink / raw)
To: git; +Cc: Linus Torvalds, Junio C Hamano
Linus Torvalds <torvalds@osdl.org> writes:
> [snip suggested improvements]
Yes, that would be more... thorough. And, especially for the
int zlib_compression_level = Z_DEFAULT_COMPRESSION;
line, more self-explanatory, too. So here's an updated patch
(replacing the previous) including your suggestions.
-j.
-
Make zlib compression level configurable, and change default.
With the change in default, "git add ." on kernel dir is about
twice as fast as before, with only minimal (0.5%) change in
object size. The speed difference is even more noticeable
when committing large files, which is now up to 8 times faster.
The configurability is through setting core.compression = [-1..9]
which maps to the zlib constants; -1 is the default, 0 is no
compression, and 1..9 are various speed/size tradeoffs, 9
being slowest.
Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no)
---
Documentation/config.txt | 6 ++++++
cache.h | 1 +
config.c | 10 ++++++++++
environment.c | 1 +
sha1_file.c | 4 ++--
5 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/Documentation/config.txt b/Documentation/config.txt
index a04c5ad..ac89be7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -91,6 +91,12 @@ core.warnAmbiguousRefs::
If true, git will warn you if the ref name you passed it is ambiguous
and might match multiple refs in the .git/refs/ tree. True by default.
+core.compression:
+ An integer -1..9, indicating the compression level for objects that
+ are not in a pack file. -1 is the zlib and git default. 0 means no
+ compression, and 1..9 are various speed/size tradeoffs, 9 being
+ slowest.
+
alias.*::
Command aliases for the gitlink:git[1] command wrapper - e.g.
after defining "alias.last = cat-file commit HEAD", the invocation
diff --git a/cache.h b/cache.h
index 8719939..84770bf 100644
--- a/cache.h
+++ b/cache.h
@@ -183,6 +183,7 @@ extern int log_all_ref_updates;
extern int warn_ambiguous_refs;
extern int shared_repository;
extern const char *apply_default_whitespace;
+extern int zlib_compression_level;
#define GIT_REPO_VERSION 0
extern int repository_format_version;
diff --git a/config.c b/config.c
index ec44827..b23f4bf 100644
--- a/config.c
+++ b/config.c
@@ -279,6 +279,16 @@ int git_default_config(const char *var,
return 0;
}
+ if (!strcmp(var, "core.compression")) {
+ int level = git_config_int(var, value);
+ if (level == -1)
+ level = Z_DEFAULT_COMPRESSION;
+ else if (level < 0 || level > Z_BEST_COMPRESSION)
+ die("bad zlib compression level %d", level);
+ zlib_compression_level = level;
+ return 0;
+ }
+
if (!strcmp(var, "user.name")) {
strlcpy(git_default_name, value, sizeof(git_default_name));
return 0;
diff --git a/environment.c b/environment.c
index 3de8eb3..43823ff 100644
--- a/environment.c
+++ b/environment.c
@@ -20,6 +20,7 @@ int repository_format_version = 0;
char git_commit_encoding[MAX_ENCODING_LENGTH] = "utf-8";
int shared_repository = PERM_UMASK;
const char *apply_default_whitespace = NULL;
+int zlib_compression_level = Z_DEFAULT_COMPRESSION;
static char *git_dir, *git_object_dir, *git_index_file, *git_refs_dir,
*git_graft_file;
diff --git a/sha1_file.c b/sha1_file.c
index 8179630..bc35808 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1458,7 +1458,7 @@ int write_sha1_file(void *buf, unsigned
/* Set it up */
memset(&stream, 0, sizeof(stream));
- deflateInit(&stream, Z_BEST_COMPRESSION);
+ deflateInit(&stream, zlib_compression_level);
size = deflateBound(&stream, len+hdrlen);
compressed = xmalloc(size);
@@ -1511,7 +1511,7 @@ static void *repack_object(const unsigne
/* Set it up */
memset(&stream, 0, sizeof(stream));
- deflateInit(&stream, Z_BEST_COMPRESSION);
+ deflateInit(&stream, zlib_compression_level);
size = deflateBound(&stream, len + hdrlen);
buf = xmalloc(size);
--
1.4.1.g8fced-dirty
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 11:13 Compression speed for large files Joachim B Haga
2006-07-03 12:03 ` Alex Riesen
@ 2006-07-03 21:45 ` Jeff King
2006-07-03 22:25 ` Joachim Berdal Haga
1 sibling, 1 reply; 21+ messages in thread
From: Jeff King @ 2006-07-03 21:45 UTC (permalink / raw)
To: Joachim B Haga; +Cc: git
On Mon, Jul 03, 2006 at 11:13:34AM +0000, Joachim B Haga wrote:
> often binary. In git, committing of large files is very slow; I have
> tested with a 45MB file, which takes about 1 minute to check in (on an
> intel core-duo 2GHz).
I know this has already been somewhat solved, but I found your numbers
curiously high. I work quite a bit with git and large files and I
haven't noticed this slowdown. Can you be more specific about your load?
Are you sure it is zlib?
On my 1.8Ghz Athlon, compressing 45MB of zeros into 20K takes about 2s.
Compressing 45MB of random data into a 45MB object takes 6.3s. In either
case, the commit takes only about 0.5s (since cogito stores the object
during the cg-add).
Is there some specific file pattern which is slow to compress?
-Peff
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 21:45 ` Compression speed for large files Jeff King
@ 2006-07-03 22:25 ` Joachim Berdal Haga
2006-07-03 23:02 ` Linus Torvalds
0 siblings, 1 reply; 21+ messages in thread
From: Joachim Berdal Haga @ 2006-07-03 22:25 UTC (permalink / raw)
To: Jeff King; +Cc: Joachim B Haga, git
Jeff King wrote:
> On Mon, Jul 03, 2006 at 11:13:34AM +0000, Joachim B Haga wrote:
>
>> often binary. In git, committing of large files is very slow; I have
>> tested with a 45MB file, which takes about 1 minute to check in (on an
>> intel core-duo 2GHz).
>
> I know this has already been somewhat solved, but I found your numbers
> curiously high. I work quite a bit with git and large files and I
> haven't noticed this slowdown. Can you be more specific about your load?
> Are you sure it is zlib?
Quite sure: at least to the extent that it is fixed by lowering the
compression level. But the wording was inexact: it's during object
creation, which happens at initial "git add" and then later during "git
commit".
But...
> y 1.8Ghz Athlon, compressing 45MB of zeros into 20K takes about 2s.
> Compressing 45MB of random data into a 45MB object takes 6.3s. In either
> case, the commit takes only about 0.5s (since cogito stores the object
> during the cg-add).
>
> Is there some specific file pattern which is slow to compress?
yes, it seems so. At least the effect is much more pronounced for my
files than for random/null data. "My" files are in this context generated
data files, binary or ascii.
Here's a test with "time gzip -[169] -c file >/dev/null". Random data
from /dev/urandom, kernel headers are concatenation of *.h in kernel
sources. All times in seconds, on my puny home computer (1GHz Via Nehemiah)
random (23MB) data (23MB) headers (44MB)
-9 10.2 72.5 38.5
-6 10.2 13.5 12.9
-1 9.9 4.1 7.0
So... data dependent, yes. But it hits even for normal source code.
(Btw; the default (-6) seems to be less data dependent than the other
values. Maybe that's on purpose.)
If you want to look at a highly-variable dataset (the one above), try
http://lupus.ig3.net/SIMULATION.dx.gz (5MB, slow server), but that's just
an example, I see the same variability for example also on binary data files.
-j.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 22:25 ` Joachim Berdal Haga
@ 2006-07-03 23:02 ` Linus Torvalds
2006-07-04 5:42 ` Joachim Berdal Haga
0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2006-07-03 23:02 UTC (permalink / raw)
To: Joachim Berdal Haga; +Cc: Jeff King, Joachim B Haga, git
On Tue, 4 Jul 2006, Joachim Berdal Haga wrote:
>
> Here's a test with "time gzip -[169] -c file >/dev/null". Random data
> from /dev/urandom, kernel headers are concatenation of *.h in kernel
> sources. All times in seconds, on my puny home computer (1GHz Via Nehemiah)
That "Via Nehemiah" is probably a big part of it.
I think the VIA Nehemiah just has a 64kB L2 cache, and I bet performance
plummets if the tables end up being used past that.
And I think a large part of the higher compressions is that they allow the
compression window and tables to grow bigger.
Linus
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Compression speed for large files
2006-07-03 23:02 ` Linus Torvalds
@ 2006-07-04 5:42 ` Joachim Berdal Haga
0 siblings, 0 replies; 21+ messages in thread
From: Joachim Berdal Haga @ 2006-07-04 5:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff King, git
Linus Torvalds wrote:
>
> On Tue, 4 Jul 2006, Joachim Berdal Haga wrote:
>> Here's a test with "time gzip -[169] -c file >/dev/null". Random data
>> from /dev/urandom, kernel headers are concatenation of *.h in kernel
>> sources. All times in seconds, on my puny home computer (1GHz Via Nehemiah)
>
> That "Via Nehemiah" is probably a big part of it.
>
> I think the VIA Nehemiah just has a 64kB L2 cache, and I bet performance
> plummets if the tables end up being used past that.
Not really. The numbers in my original post were from a Intel core-duo,
they were: 158/18/6 s for comparable (but larger) data.
And on a P4 1.8GHz with 512kB L2, the same 23MB data file compresses in
28.1/5.9/1.3 s. That's a factor 22 slowest/fastest; the VIA was only
factor 18, so the difference is actually *larger*.
-j.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] Use configurable zlib compression level everywhere.
2006-07-03 19:43 ` Junio C Hamano
@ 2006-07-07 21:53 ` David Lang
2006-07-08 2:10 ` Johannes Schindelin
0 siblings, 1 reply; 21+ messages in thread
From: David Lang @ 2006-07-07 21:53 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Joachim B Haga, git
On Mon, 3 Jul 2006, Junio C Hamano wrote:
> * sha1write_compressed() in csum-file.c is for producing packs
> and most of the things we compress there are deltas and less
> compressible, so even when core.compression is set to high we
> might be better off using faster compression.
why would deltas have poor compression? I'd expect them to have about the same
as the files they are deltas of (or slightly better due to the fact that the
deta metainfo is highly repetitive)
David Lang
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] Use configurable zlib compression level everywhere.
2006-07-07 21:53 ` David Lang
@ 2006-07-08 2:10 ` Johannes Schindelin
0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2006-07-08 2:10 UTC (permalink / raw)
To: David Lang; +Cc: Junio C Hamano, Joachim B Haga, git
Hi,
On Fri, 7 Jul 2006, David Lang wrote:
> On Mon, 3 Jul 2006, Junio C Hamano wrote:
>
> > * sha1write_compressed() in csum-file.c is for producing packs
> > and most of the things we compress there are deltas and less
> > compressible, so even when core.compression is set to high we
> > might be better off using faster compression.
>
> why would deltas have poor compression? I'd expect them to have about the same
> as the files they are deltas of (or slightly better due to the fact that the
> deta metainfo is highly repetitive)
Deltas should have poor compression by definition, because compression
tries to encode those parts of the file more efficiently, which do not
bear much information (think entropy).
If you have deltas which really make sense, they are almost _pure_
information, i.e. they do not contain much redundancy, as compared to real
files. So, the compression (which does not know anything about the
characteristics of deltas in particular) cannot take much redundancy out
of the delta. Therefore, the entropy is very high, and the compression
rate is low.
Hope this makes sense to you,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2006-07-08 2:10 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-03 11:13 Compression speed for large files Joachim B Haga
2006-07-03 12:03 ` Alex Riesen
2006-07-03 12:42 ` Elrond
2006-07-03 13:44 ` Joachim B Haga
2006-07-03 13:32 ` Joachim Berdal Haga
[not found] ` <Pine.LN X.4.64.0607031030150.1213@localhost.localdomain>
2006-07-03 14:33 ` Nicolas Pitre
2006-07-03 14:54 ` Yakov Lerner
2006-07-03 15:17 ` Johannes Schindelin
2006-07-03 16:31 ` Linus Torvalds
2006-07-03 18:59 ` [PATCH] Make zlib compression level configurable, and change default Joachim B Haga
2006-07-03 19:33 ` Linus Torvalds
2006-07-03 19:50 ` Linus Torvalds
2006-07-03 20:11 ` Joachim B Haga
2006-07-03 19:02 ` [PATCH] Use configurable zlib compression level everywhere Joachim B Haga
2006-07-03 19:43 ` Junio C Hamano
2006-07-07 21:53 ` David Lang
2006-07-08 2:10 ` Johannes Schindelin
2006-07-03 21:45 ` Compression speed for large files Jeff King
2006-07-03 22:25 ` Joachim Berdal Haga
2006-07-03 23:02 ` Linus Torvalds
2006-07-04 5:42 ` Joachim Berdal Haga
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).