two questions about the format of loose object

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Liu Yubao <yubao.liu@gmail.com>
To: git list <git@vger.kernel.org>
Subject: two questions about the format of loose object
Date: Mon, 01 Dec 2008 16:00:55 +0800	[thread overview]
Message-ID: <493399B7.5000505@gmail.com> (raw)

Hi,

In current implementation the loose objects are compressed:

     loose object = deflate(typename + <space> + size + '\0' + data)

In sha1_file.c:unpack_sha1_file():
	1) unpack_sha1_header() inflates first 8KB
        2) parse_sha1_header() gets object's size
        3) unpack_sha1_reset() allocates a (1+size) bytes buffer and
           copy the first 8KB without header to it.

* Question 1:

Why not use the format below for loose object?
    loose object = typename + <space> + size + '\0' + deflate(data)

So the size of loose object can be known before inflating it, in
step 3 above the 8KB memcpy isn't required.

In general, deflate() can decrease file size by 70% for text file, 
I checked the git source and linux-2.6 source and got the statistical
data below:

.------------------+--------------+--------.
|                  | <= (8/0.3)KB | <= 8KB |
|------------------+--------------+--------|
| git-1.6.03       |          97% |    84% |
| linux-2.6.27-rc6 |          90% |    66% |
`------------------+--------------+--------'


* Question 2:

Why not use uncompressed loose object? That's to say:
   loose object = typename + <space> + size + '\0' + data

I did a simple benchmark on my notebook and a server in my company,
writing a big file to disk is faster than compressing it first and
writing the result out. The former's performance for reading should
also be better because of file cache.

The current implementation caches objects in one process, the objects
can't be shared by many processes because they are uncompressed
to heap memory area of each process.

Uncompressed loose objects are better for sharing objects among
multiple git processes because they can be used directly after being
mmap-ed.

And I guess the most frequently used objects are loose objects
when you do some coding(git add, git diff, git diff --cached, git merge),
using uncompressed loose objects avoids uncompressing loose objects again
and again.


Below is the result of my simple benchmark:

########################################
# on my notebook
$ perl b.pl git-1.5.6/Makefile 1000
               Rate   compressed uncompressed
compressed    198/s           --         -92%
uncompressed 2463/s        1147%           --


$ perl b.pl git-1.5.6/parse-options.c 2000
               Rate   compressed uncompressed
compressed    341/s           --         -88%
uncompressed 2845/s         734%           --


$ find git-1.5.6/ -name "*.[ch]" -exec cat {} + > all.c
$ perl b.pl all.c 1000
               Rate   compressed uncompressed
compressed   3.39/s           --         -97%
uncompressed  111/s        3182%           --

#######################################
# on a server
$ perl b.pl Makefile 6000
            (warning: too few iterations for a reliable count)
                Rate   compressed uncompressed
compressed     447/s           --         -98%
uncompressed 18750/s        4094%           --

$ perl b.pl parse-options.c 8000
            (warning: too few iterations for a reliable count)
                Rate   compressed uncompressed
compressed    1130/s           --         -97%
uncompressed 33333/s        2850%           --

$ perl b.pl all.c 1000
               Rate   compressed uncompressed
compressed   5.48/s           --         -95%
uncompressed  115/s        1997%           

#####################################################
# b.pl
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:hireswallclock cmpthese);
use File::Slurp;
use IO::Compress::Deflate qw(deflate $DeflateError);

my $text = read_file($ARGV[0], binmode => ':raw');

cmpthese($ARGV[1], {'compressed' => \&zip, 'uncompressed' => \&output});

sub zip {
    deflate \$text => 'all.c.z' || die "$!\n";
}

sub output {
    write_file("all2.c", {binmode => ':raw'}, $text);
}

next             reply	other threads:[~2008-12-01  8:02 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-01  8:00 Liu Yubao [this message]
2008-12-01  8:25 ` two questions about the format of loose object Junio C Hamano
2008-12-01  9:28   ` Liu Yubao
2008-12-01 11:32     ` Jakub Narebski
2008-12-02  2:19       ` Liu Yubao
2008-12-01 15:21     ` Shawn O. Pearce
2008-12-02  2:43       ` Liu Yubao
2008-12-02  1:48   ` [PATCH 0/5] support reading and writing uncompressed " Liu Yubao
2008-12-02  1:51   ` [PATCH 1/5] avoid parse_sha1_header() accessing memory out of bound Liu Yubao
2008-12-02 15:42     ` Shawn O. Pearce
2008-12-03  3:49       ` Liu Yubao
2008-12-02  1:53   ` [PATCH 2/5] don't die immediately when convert an invalid type name Liu Yubao
2008-12-02  1:55   ` [PATCH 3/5] optimize parse_sha1_header() a little by detecting object type Liu Yubao
2008-12-02 15:53     ` Shawn O. Pearce
2008-12-03  4:06       ` Liu Yubao
2008-12-02  1:56   ` [PATCH 4/5] support reading uncompressed loose object Liu Yubao
2008-12-02 15:58     ` Shawn O. Pearce
2008-12-03  4:09       ` Liu Yubao
2008-12-02  2:03   ` [PATCH 5/5] support writing " Liu Yubao
2008-12-02 16:07     ` Shawn O. Pearce
2008-12-03  4:22       ` Liu Yubao
2008-12-02  3:11   ` [PATCH 0/5] support reading and " Liu Yubao
2008-12-01 12:16 ` two questions about the format of " Nick Andrew
2008-12-02  2:26   ` Liu Yubao
2008-12-01 15:32 ` Shawn O. Pearce
2008-12-02  3:05   ` Liu Yubao
2008-12-04  0:54     ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=493399B7.5000505@gmail.com \
    --to=yubao.liu@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).