git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git repository size / compression
@ 2011-09-09  2:37 neubyr
  2011-09-09  8:23 ` Carlos Martín Nieto
  0 siblings, 1 reply; 10+ messages in thread
From: neubyr @ 2011-09-09  2:37 UTC (permalink / raw)
  To: git

I have a test git repository with just two files in it. One of the
file in it has a set of two lines that is repeated n times.
e.g.:
{{{
$ for i in {1..5}; do cat ./lexico.txt >> lexico1.txt &&  cat
./lexico.txt >> lexico1.txt && mv ./lexico1.txt ./lexico.txt;  done
}}}

I ran above command few times and performed commit after each run. Now
disk usage of this repository directory is mentioned below. The 419M
is working directory size and 2.7M is git repository/database size.

{{{
$ du -h -d 1 .
2.7M    ./.git
419M    .

}}}

Is it because of the compression performed by git before storing data
(or before sending commit)??

Following were results with subversion:

Subversion client (redundant(?) copy exists in .svn/text-base/
directory, hence double size in client):
{{{
$ du -h -d 1
416M    ./.svn
832M    .
}}}

Subversion repo/server:
{{{
$ du -h -d 1
 12K    ./conf
1.2M    ./db
 36K    ./hooks
8.0K    ./locks
1.2M    .
}}}

--
neuby.r

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09  2:37 git repository size / compression neubyr
@ 2011-09-09  8:23 ` Carlos Martín Nieto
  2011-09-09 14:04   ` neubyr
  2011-09-09 16:05   ` John Szakmeister
  0 siblings, 2 replies; 10+ messages in thread
From: Carlos Martín Nieto @ 2011-09-09  8:23 UTC (permalink / raw)
  To: neubyr; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2087 bytes --]

On Thu, 2011-09-08 at 21:37 -0500, neubyr wrote:
> I have a test git repository with just two files in it. One of the
> file in it has a set of two lines that is repeated n times.
> e.g.:
> {{{
> $ for i in {1..5}; do cat ./lexico.txt >> lexico1.txt &&  cat
> ./lexico.txt >> lexico1.txt && mv ./lexico1.txt ./lexico.txt;  done
> }}}
> 

So you've just created some data that can be compressed quite
efficiently.

> I ran above command few times and performed commit after each run. Now
> disk usage of this repository directory is mentioned below. The 419M
> is working directory size and 2.7M is git repository/database size.
> 
> {{{
> $ du -h -d 1 .
> 2.7M    ./.git
> 419M    .
> 
> }}}
> 
> Is it because of the compression performed by git before storing data
> (or before sending commit)??
> 

Yes. Git stores its objects (the commit, the snapshot of the files,
etc.) compressed. When these objects are stored in a pack, the size can
be further reduced by storing some objects as deltas which describe the
difference between itself and some other object in the object-db.

> Following were results with subversion:
> 
> Subversion client (redundant(?) copy exists in .svn/text-base/
> directory, hence double size in client):
> {{{
> $ du -h -d 1
> 416M    ./.svn
> 832M    .
> }}}

Subversion stores the "pristines" (which is the status of the files in
the latest revision) inside the .svn directory. I wouldn't call this
copy redundant, though, as it allows you to run diff locally. The
pristines are stored uncompressed, which is why you half of the space is
taken up by the .svn directory.

> 
> Subversion repo/server:
> {{{
> $ du -h -d 1
>  12K    ./conf
> 1.2M    ./db
>  36K    ./hooks
> 8.0K    ./locks
> 1.2M    .
> }}}

I don't know how the repository is stored in Subversion, but it may also
be compressed. You may be able to reduced your git repository size by
(re)generating packs with 'git repack' and doing some cleanups with 'git
gc', but the repository size is not often a concern.

   cmn



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09  8:23 ` Carlos Martín Nieto
@ 2011-09-09 14:04   ` neubyr
  2011-09-09 14:25     ` Sverre Rabbelier
                       ` (2 more replies)
  2011-09-09 16:05   ` John Szakmeister
  1 sibling, 3 replies; 10+ messages in thread
From: neubyr @ 2011-09-09 14:04 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

On Fri, Sep 9, 2011 at 3:23 AM, Carlos Martín Nieto <cmn@elego.de> wrote:
> On Thu, 2011-09-08 at 21:37 -0500, neubyr wrote:
>> I have a test git repository with just two files in it. One of the
>> file in it has a set of two lines that is repeated n times.
>> e.g.:
>> {{{
>> $ for i in {1..5}; do cat ./lexico.txt >> lexico1.txt &&  cat
>> ./lexico.txt >> lexico1.txt && mv ./lexico1.txt ./lexico.txt;  done
>> }}}
>>
>
> So you've just created some data that can be compressed quite
> efficiently.
>
>> I ran above command few times and performed commit after each run. Now
>> disk usage of this repository directory is mentioned below. The 419M
>> is working directory size and 2.7M is git repository/database size.
>>
>> {{{
>> $ du -h -d 1 .
>> 2.7M    ./.git
>> 419M    .
>>
>> }}}
>>
>> Is it because of the compression performed by git before storing data
>> (or before sending commit)??
>>
>
> Yes. Git stores its objects (the commit, the snapshot of the files,
> etc.) compressed. When these objects are stored in a pack, the size can
> be further reduced by storing some objects as deltas which describe the
> difference between itself and some other object in the object-db.
>

Does git store deltas for some files? I thought it uses snapshots
(exact copy of staged files) only.


>> Following were results with subversion:
>>
>> Subversion client (redundant(?) copy exists in .svn/text-base/
>> directory, hence double size in client):
>> {{{
>> $ du -h -d 1
>> 416M    ./.svn
>> 832M    .
>> }}}
>
> Subversion stores the "pristines" (which is the status of the files in
> the latest revision) inside the .svn directory. I wouldn't call this
> copy redundant, though, as it allows you to run diff locally. The
> pristines are stored uncompressed, which is why you half of the space is
> taken up by the .svn directory.
>
>>
>> Subversion repo/server:
>> {{{
>> $ du -h -d 1
>>  12K    ./conf
>> 1.2M    ./db
>>  36K    ./hooks
>> 8.0K    ./locks
>> 1.2M    .
>> }}}
>
> I don't know how the repository is stored in Subversion, but it may also
> be compressed. You may be able to reduced your git repository size by
> (re)generating packs with 'git repack' and doing some cleanups with 'git
> gc', but the repository size is not often a concern.
>
>   cmn
>
>
>

that's helpful. thanks.

--
neuby.r

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09 14:04   ` neubyr
@ 2011-09-09 14:25     ` Sverre Rabbelier
  2011-09-09 14:28     ` Carlos Martín Nieto
  2011-09-09 14:54     ` Jakub Narebski
  2 siblings, 0 replies; 10+ messages in thread
From: Sverre Rabbelier @ 2011-09-09 14:25 UTC (permalink / raw)
  To: neubyr; +Cc: Carlos Martín Nieto, git

Heya,

On Fri, Sep 9, 2011 at 16:04, neubyr <neubyr@gmail.com> wrote:
> Does git store deltas for some files? I thought it uses snapshots
> (exact copy of staged files) only.

In packs, yes, it will try to delta objects as efficient as possible.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09 14:04   ` neubyr
  2011-09-09 14:25     ` Sverre Rabbelier
@ 2011-09-09 14:28     ` Carlos Martín Nieto
  2011-09-09 15:07       ` neubyr
  2011-09-09 14:54     ` Jakub Narebski
  2 siblings, 1 reply; 10+ messages in thread
From: Carlos Martín Nieto @ 2011-09-09 14:28 UTC (permalink / raw)
  To: neubyr; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2895 bytes --]

On Fri, 2011-09-09 at 09:04 -0500, neubyr wrote:
> On Fri, Sep 9, 2011 at 3:23 AM, Carlos Martín Nieto <cmn@elego.de> wrote:
> > On Thu, 2011-09-08 at 21:37 -0500, neubyr wrote:
> >> I have a test git repository with just two files in it. One of the
> >> file in it has a set of two lines that is repeated n times.
> >> e.g.:
> >> {{{
> >> $ for i in {1..5}; do cat ./lexico.txt >> lexico1.txt &&  cat
> >> ./lexico.txt >> lexico1.txt && mv ./lexico1.txt ./lexico.txt;  done
> >> }}}
> >>
> >
> > So you've just created some data that can be compressed quite
> > efficiently.
> >
> >> I ran above command few times and performed commit after each run. Now
> >> disk usage of this repository directory is mentioned below. The 419M
> >> is working directory size and 2.7M is git repository/database size.
> >>
> >> {{{
> >> $ du -h -d 1 .
> >> 2.7M    ./.git
> >> 419M    .
> >>
> >> }}}
> >>
> >> Is it because of the compression performed by git before storing data
> >> (or before sending commit)??
> >>
> >
> > Yes. Git stores its objects (the commit, the snapshot of the files,
> > etc.) compressed. When these objects are stored in a pack, the size can
> > be further reduced by storing some objects as deltas which describe the
> > difference between itself and some other object in the object-db.
> >
> 
> Does git store deltas for some files? I thought it uses snapshots
> (exact copy of staged files) only.

Yes and no. The data model for git is to always store snapshots, and it
always expects to have the full files available. In a packfile, however,
in order to save space, some objects are stored as deltas to other
objects in the same file.

http://progit.org/book/ch9-4.html

> 
> 
> >> Following were results with subversion:
> >>
> >> Subversion client (redundant(?) copy exists in .svn/text-base/
> >> directory, hence double size in client):
> >> {{{
> >> $ du -h -d 1
> >> 416M    ./.svn
> >> 832M    .
> >> }}}
> >
> > Subversion stores the "pristines" (which is the status of the files in
> > the latest revision) inside the .svn directory. I wouldn't call this
> > copy redundant, though, as it allows you to run diff locally. The
> > pristines are stored uncompressed, which is why you half of the space is
> > taken up by the .svn directory.
> >
> >>
> >> Subversion repo/server:
> >> {{{
> >> $ du -h -d 1
> >>  12K    ./conf
> >> 1.2M    ./db
> >>  36K    ./hooks
> >> 8.0K    ./locks
> >> 1.2M    .
> >> }}}
> >
> > I don't know how the repository is stored in Subversion, but it may also
> > be compressed. You may be able to reduced your git repository size by
> > (re)generating packs with 'git repack' and doing some cleanups with 'git
> > gc', but the repository size is not often a concern.
> >
> >   cmn
> >
> >
> >
> 
> that's helpful. thanks.
> 
> --
> neuby.r
> 



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09 14:04   ` neubyr
  2011-09-09 14:25     ` Sverre Rabbelier
  2011-09-09 14:28     ` Carlos Martín Nieto
@ 2011-09-09 14:54     ` Jakub Narebski
  2011-09-09 15:09       ` neubyr
  2 siblings, 1 reply; 10+ messages in thread
From: Jakub Narebski @ 2011-09-09 14:54 UTC (permalink / raw)
  To: neubyr; +Cc: Carlos Martín Nieto, git

neubyr <neubyr@gmail.com> writes:
> On Fri, Sep 9, 2011 at 3:23 AM, Carlos Martín Nieto <cmn@elego.de> wrote:
> > On Thu, 2011-09-08 at 21:37 -0500, neubyr wrote:

>>> I have a test git repository with just two files in it. One of the
>>> file in it has a set of two lines that is repeated n times.
>>> e.g.:
>>> {{{
>>> $ for i in {1..5}; do cat ./lexico.txt>> lexico1.txt &&  cat
>>> ./lexico.txt>> lexico1.txt && mv ./lexico1.txt ./lexico.txt;  done
>>> }}}
>>>
>>
>> So you've just created some data that can be compressed quite
>> efficiently.
>>
>>> I ran above command few times and performed commit after each run. Now
>>> disk usage of this repository directory is mentioned below. The 419M
>>> is working directory size and 2.7M is git repository/database size.
>>>
>>> {{{
>>> $ du -h -d 1 .
>>> 2.7M    ./.git
>>> 419M    .
>>>
>>> }}}

Have you tried the same but with

   $ git gc --prune=now

before running `du`?

>>> Is it because of the compression performed by git before storing data
>>> (or before sending commit)??
>>
>> Yes. Git stores its objects (the commit, the snapshot of the files,
>> etc.) compressed. When these objects are stored in a pack, the size can
>> be further reduced by storing some objects as deltas which describe the
>> difference between itself and some other object in the object-db.
> 
> Does git store deltas for some files? I thought it uses snapshots
> (exact copy of staged files) only.

When creating packfile from loose objects (e.g. via `git gc`), it
does perform delta compression.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09 14:28     ` Carlos Martín Nieto
@ 2011-09-09 15:07       ` neubyr
  0 siblings, 0 replies; 10+ messages in thread
From: neubyr @ 2011-09-09 15:07 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

On Fri, Sep 9, 2011 at 9:28 AM, Carlos Martín Nieto <cmn@elego.de> wrote:
> On Fri, 2011-09-09 at 09:04 -0500, neubyr wrote:
>> On Fri, Sep 9, 2011 at 3:23 AM, Carlos Martín Nieto <cmn@elego.de> wrote:
>> > On Thu, 2011-09-08 at 21:37 -0500, neubyr wrote:
>> >> I have a test git repository with just two files in it. One of the
>> >> file in it has a set of two lines that is repeated n times.
>> >> e.g.:
>> >> {{{
>> >> $ for i in {1..5}; do cat ./lexico.txt >> lexico1.txt &&  cat
>> >> ./lexico.txt >> lexico1.txt && mv ./lexico1.txt ./lexico.txt;  done
>> >> }}}
>> >>
>> >
>> > So you've just created some data that can be compressed quite
>> > efficiently.
>> >
>> >> I ran above command few times and performed commit after each run. Now
>> >> disk usage of this repository directory is mentioned below. The 419M
>> >> is working directory size and 2.7M is git repository/database size.
>> >>
>> >> {{{
>> >> $ du -h -d 1 .
>> >> 2.7M    ./.git
>> >> 419M    .
>> >>
>> >> }}}
>> >>
>> >> Is it because of the compression performed by git before storing data
>> >> (or before sending commit)??
>> >>
>> >
>> > Yes. Git stores its objects (the commit, the snapshot of the files,
>> > etc.) compressed. When these objects are stored in a pack, the size can
>> > be further reduced by storing some objects as deltas which describe the
>> > difference between itself and some other object in the object-db.
>> >
>>
>> Does git store deltas for some files? I thought it uses snapshots
>> (exact copy of staged files) only.
>
> Yes and no. The data model for git is to always store snapshots, and it
> always expects to have the full files available. In a packfile, however,
> in order to save space, some objects are stored as deltas to other
> objects in the same file.
>
> http://progit.org/book/ch9-4.html
>

Excellent.. That explains compression and deltas really well. Thanks again..

--
neuby.r

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09 14:54     ` Jakub Narebski
@ 2011-09-09 15:09       ` neubyr
  0 siblings, 0 replies; 10+ messages in thread
From: neubyr @ 2011-09-09 15:09 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Carlos Martín Nieto, git, pjweisberg

2011/9/9 Jakub Narebski <jnareb@gmail.com>:
> neubyr <neubyr@gmail.com> writes:
>> On Fri, Sep 9, 2011 at 3:23 AM, Carlos Martín Nieto <cmn@elego.de> wrote:
>> > On Thu, 2011-09-08 at 21:37 -0500, neubyr wrote:
>
>>>> I have a test git repository with just two files in it. One of the
>>>> file in it has a set of two lines that is repeated n times.
>>>> e.g.:
>>>> {{{
>>>> $ for i in {1..5}; do cat ./lexico.txt>> lexico1.txt &&  cat
>>>> ./lexico.txt>> lexico1.txt && mv ./lexico1.txt ./lexico.txt;  done
>>>> }}}
>>>>
>>>
>>> So you've just created some data that can be compressed quite
>>> efficiently.
>>>
>>>> I ran above command few times and performed commit after each run. Now
>>>> disk usage of this repository directory is mentioned below. The 419M
>>>> is working directory size and 2.7M is git repository/database size.
>>>>
>>>> {{{
>>>> $ du -h -d 1 .
>>>> 2.7M    ./.git
>>>> 419M    .
>>>>
>>>> }}}
>
> Have you tried the same but with
>
>   $ git gc --prune=now
>
> before running `du`?
>

Nope, I hadn't run git gc before. Here are du results after running
git gc command. That's about 55% less space now.. Great!

{{{
$ du -d 1 -h
924K    ./.git
417M    .
}}}


>>>> Is it because of the compression performed by git before storing data
>>>> (or before sending commit)??
>>>
>>> Yes. Git stores its objects (the commit, the snapshot of the files,
>>> etc.) compressed. When these objects are stored in a pack, the size can
>>> be further reduced by storing some objects as deltas which describe the
>>> difference between itself and some other object in the object-db.
>>
>> Does git store deltas for some files? I thought it uses snapshots
>> (exact copy of staged files) only.
>
> When creating packfile from loose objects (e.g. via `git gc`), it
> does perform delta compression.
>
> --
> Jakub Narębski
>

thank you everyone for explaining in detail..

--
neuby.r

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09  8:23 ` Carlos Martín Nieto
  2011-09-09 14:04   ` neubyr
@ 2011-09-09 16:05   ` John Szakmeister
  2011-09-09 17:49     ` Andreas Krey
  1 sibling, 1 reply; 10+ messages in thread
From: John Szakmeister @ 2011-09-09 16:05 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: neubyr, git

On Fri, Sep 9, 2011 at 4:23 AM, Carlos Martín Nieto <cmn@elego.de> wrote:
[snip]
>> Subversion repo/server:
>> {{{
>> $ du -h -d 1
>>  12K    ./conf
>> 1.2M    ./db
>>  36K    ./hooks
>> 8.0K    ./locks
>> 1.2M    .
>> }}}
>
> I don't know how the repository is stored in Subversion, but it may also
> be compressed. You may be able to reduced your git repository size by
> (re)generating packs with 'git repack' and doing some cleanups with 'git
> gc', but the repository size is not often a concern.

It is stored compressed in Subversion, and it also generates deltas
against previous versions.  IIRC, the delta algorithm in an xdelta
based one, and then the data is run through compression.  Subversion
will at times choose to self-compress the file, instead of doing a
delta and compressing.  IIRC, there is some heuristics in there for
determining when to do that, but I forget the exact method.

HTH!

-John

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git repository size / compression
  2011-09-09 16:05   ` John Szakmeister
@ 2011-09-09 17:49     ` Andreas Krey
  0 siblings, 0 replies; 10+ messages in thread
From: Andreas Krey @ 2011-09-09 17:49 UTC (permalink / raw)
  To: John Szakmeister; +Cc: Carlos Martín Nieto, neubyr, git

On Fri, 09 Sep 2011 12:05:03 +0000, John Szakmeister wrote:
...
> will at times choose to self-compress the file, instead of doing a
> delta and compressing.  IIRC, there is some heuristics in there for
> determining when to do that, but I forget the exact method.

Don't know about the compression part, but subversion does a delta of the nth
version of a file (not the global revision number n) against the version m, where
m is (n & (n-1)), or the least significant '1' bit flipped to '0'. That way, there
are only O(log(n)) instead of O(n) deltas to apply to get at a specific version.

[Was on the svn users list just then. They described it differently,
 but in essence it's that.]

Andreas

-- 
"Totally trivial. Famous last words."
From: Linus Torvalds <torvalds@*.org>
Date: Fri, 22 Jan 2010 07:29:21 -0800

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-09-09 17:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-09  2:37 git repository size / compression neubyr
2011-09-09  8:23 ` Carlos Martín Nieto
2011-09-09 14:04   ` neubyr
2011-09-09 14:25     ` Sverre Rabbelier
2011-09-09 14:28     ` Carlos Martín Nieto
2011-09-09 15:07       ` neubyr
2011-09-09 14:54     ` Jakub Narebski
2011-09-09 15:09       ` neubyr
2011-09-09 16:05   ` John Szakmeister
2011-09-09 17:49     ` Andreas Krey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).