git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* pack-object poor performance (with large number of objects?)
@ 2011-10-03 14:43 Piotr Krukowiecki
  2011-10-03 16:05 ` Shawn Pearce
  0 siblings, 1 reply; 11+ messages in thread
From: Piotr Krukowiecki @ 2011-10-03 14:43 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Ingo Molnar, Junio C Hamano

Hi,

I'm having poor git gc (pack-object) performance. Please read below
for details. What can I do to improve the performance/debug the reason
for the slowness? Should I leave the process running over night, or
should I stop it (for debugging)?
CCing people who posted some patches/benchmarks for pack-objects recently.

git gc was first run automatically by git svn clone. It found 1544673
objects and worked for 50 minutes until I've killed it.

Then I've run it by hand with --aggresive (because I've found on
Internet it helped in some cases). It found 1742200 objects this time.
At this moment it's been working for about 90 minutes.

The large number of unpacked objects is probably caused by me - I've
disabled auto gc when I was cloning from svn (I though it might speed
up things if it didn't repack several times during clone, only once
afterwards).

My git version is 1.7.7.rc3.4.g8d714
The file system is ext4.

First run process tree:
pkruk    27873  0.0  0.0  15704   816 pts/2    S+   11:53   0:00
       |           |                       \_ git gc --auto
pkruk    27885  0.0  0.0  15704   776 pts/2    S+   11:53   0:00
       |           |                           \_ git repack -d -l
pkruk    27886  0.0  0.0   4220   608 pts/2    S+   11:53   0:00
       |           |                               \_ /bin/sh
/usr/local/stow/git-master/libexec/git-core/git-repack -d -l
pkruk    27897  3.6  9.3 1136072 381148 pts/2  D+   11:53   5:51
       |           |                                   \_ git
pack-objects --keep-true-parents --honor-pack-keep --non-empty --all
--reflog --unpacked --incremental --local --delta-base-offset
/home/pkruk/dv/devel1_git_repos/.git/objects/pack/.tmp-27886-pack

Second run process tree:
pkruk     6171  0.0  0.0  15704  1428 pts/2    S+   14:34   0:00
       |           |               \_ git gc --aggressive
pkruk     6174  0.0  0.0  15704  1356 pts/2    S+   14:34   0:00
       |           |                   \_ git repack -d -l -f
--depth=250 --window=250 -A
pkruk     6175  0.0  0.0   4220   648 pts/2    S+   14:34   0:00
       |           |                       \_ /bin/sh
/usr/local/stow/git-master/libexec/git-core/git-repack -d -l -f
--depth=250 --window=250 -A
pkruk     6189  4.9 10.5 1143640 427396 pts/2  D+   14:34   4:50
       |           |                           \_ git pack-objects
--keep-true-parents --honor-pack-keep --non-empty --all --reflog
--unpack-unreachable --local --no-reuse-delta --depth=250 --window=250
--delta-base-offset
/home/pkruk/dv/devel1_git_repos/.git/objects/pack/.tmp-6175-pack



-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-03 14:43 pack-object poor performance (with large number of objects?) Piotr Krukowiecki
@ 2011-10-03 16:05 ` Shawn Pearce
  2011-10-03 17:17   ` Piotr Krukowiecki
  0 siblings, 1 reply; 11+ messages in thread
From: Shawn Pearce @ 2011-10-03 16:05 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Git Mailing List, Ingo Molnar, Junio C Hamano

On Mon, Oct 3, 2011 at 07:43, Piotr Krukowiecki
<piotr.krukowiecki@gmail.com> wrote:
> I'm having poor git gc (pack-object) performance. Please read below
> for details. What can I do to improve the performance/debug the reason
> for the slowness? Should I leave the process running over night, or
> should I stop it (for debugging)?
> CCing people who posted some patches/benchmarks for pack-objects recently.
>
> git gc was first run automatically by git svn clone. It found 1544673
> objects and worked for 50 minutes until I've killed it.
>
> Then I've run it by hand with --aggresive (because I've found on
> Internet it helped in some cases). It found 1742200 objects this time.
> At this moment it's been working for about 90 minutes.

Packing time depends on a number of factors. One of them is the number
of unpacked objects to process. With 1.7 million objects, yes, its
going to take some time. Another factor is how much RAM you have on
your system. Packing requires a lot of memory, especially with the
--aggressive flag, as the packer tries up to 250 different
combinations of two objects searching for a good delta compression
format, and all 250 of those are typically in-memory at once. If you
have insufficient physical RAM, the system will swap, unless you
decrease the window size.

> The large number of unpacked objects is probably caused by me - I've
> disabled auto gc when I was cloning from svn (I though it might speed
> up things if it didn't repack several times during clone, only once
> afterwards).

Yes, this the reason `git svn` runs GC during its import. If you defer
all of the repacking work until the end, with everything loose, it can
take a very, very long time to repack. If you repack as you go, the
incremental repacks are less expensive than a full repack, and the
entire process will go faster overall.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-03 16:05 ` Shawn Pearce
@ 2011-10-03 17:17   ` Piotr Krukowiecki
  2011-10-03 19:34     ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Piotr Krukowiecki @ 2011-10-03 17:17 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Git Mailing List, Ingo Molnar, Junio C Hamano

On Mon, Oct 3, 2011 at 6:05 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Mon, Oct 3, 2011 at 07:43, Piotr Krukowiecki
> <piotr.krukowiecki@gmail.com> wrote:
>> I'm having poor git gc (pack-object) performance. Please read below
>> for details. What can I do to improve the performance/debug the reason
>> for the slowness? Should I leave the process running over night, or
>> should I stop it (for debugging)?
>> CCing people who posted some patches/benchmarks for pack-objects recently.
>>
>> git gc was first run automatically by git svn clone. It found 1544673
>> objects and worked for 50 minutes until I've killed it.
>>
>> Then I've run it by hand with --aggresive (because I've found on
>> Internet it helped in some cases). It found 1742200 objects this time.
>> At this moment it's been working for about 90 minutes.
>
> Packing time depends on a number of factors. One of them is the number
> of unpacked objects to process. With 1.7 million objects, yes, its
> going to take some time.

Any statistics how long it should take?


> Another factor is how much RAM you have on
> your system. Packing requires a lot of memory, especially with the
> --aggressive flag, as the packer tries up to 250 different
> combinations of two objects searching for a good delta compression
> format, and all 250 of those are typically in-memory at once. If you
> have insufficient physical RAM, the system will swap, unless you
> decrease the window size.

I have 4GB of RAM and not all was used so it certainly shouldn't be
swapping. The process was in 'D' state so I suppose the hard disk
might be the limiting factor.

I think I also disabled threading (I'll check tomorrow) - I suppose it
has impact on packing time too.

I'll re-run packing tomorrow with threading and check the memory
usage, is there anything else I can do?

>
>> The large number of unpacked objects is probably caused by me - I've
>> disabled auto gc when I was cloning from svn (I though it might speed
>> up things if it didn't repack several times during clone, only once
>> afterwards).
>
> Yes, this the reason `git svn` runs GC during its import. If you defer
> all of the repacking work until the end, with everything loose, it can
> take a very, very long time to repack. If you repack as you go, the
> incremental repacks are less expensive than a full repack, and the
> entire process will go faster overall.

I've learned the lesson :)

-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-03 17:17   ` Piotr Krukowiecki
@ 2011-10-03 19:34     ` Junio C Hamano
  2011-10-04  7:59       ` Piotr Krukowiecki
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2011-10-03 19:34 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Shawn Pearce, Git Mailing List, Ingo Molnar

Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes:

>> Packing time depends on a number of factors. One of them is the number
>> of unpacked objects to process. With 1.7 million objects, yes, its
>> going to take some time.
>
> Any statistics how long it should take?

Packing time depends on the repository, your machine and how you pack, so
such statistics would be useful only in comparable contexts.

    linux-3.0/master$ time git repack -a -d 
    Counting objects: 2138578, done.
    Delta compression using up to 4 threads.
    Compressing objects: 100% (327257/327257), done.
    Writing objects: 100% (2138578/2138578), done.
    Total 2138578 (delta 1791983), reused 2138009 (delta 1791434)

    real    1m40.528s
    user    1m22.805s
    sys     0m3.788s
    linux-3.0/master$ git count-objects -v
    count: 0
    size: 0
    in-pack: 2138578
    packs: 1
    size-pack: 487957
    prune-packable: 0
    garbage: 0

This is on my box [*1*] that is idle (other than running the repack). The
above is starting from an already reasonably well packed state and reuses
deltas; with "-f" to repack everything from scratch it would take
significantly longer:

    linux-3.0/master$ time git repack -a -d -f
    Counting objects: 2138578, done.
    Delta compression using up to 4 threads.
    Compressing objects: 100% (2118691/2118691), done.
    Writing objects: 100% (2138578/2138578), done.
    Total 2138578 (delta 1749156), reused 344219 (delta 0)

    real    3m26.750s
    user    8m41.857s
    sys     0m6.716s

Larger "window" tends to make the process take longer (I think it grows
squared) but may reduce both the resulting packsize and runtime access
overhead. Larger "depth" does not affect time to pack and helps reducing
the resulting packsize, with increased runtime access overhead (i.e. not
really recommended).

[Footnote]

*1* http://gitster.livejournal.com/34818.html

Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz with 8GB memory.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-03 19:34     ` Junio C Hamano
@ 2011-10-04  7:59       ` Piotr Krukowiecki
  2011-10-04 11:07         ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Piotr Krukowiecki @ 2011-10-04  7:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn Pearce, Git Mailing List, Ingo Molnar

On Mon, Oct 3, 2011 at 9:34 PM, Junio C Hamano <gitster@pobox.com> wrote:
> This is on my box [*1*] that is idle (other than running the repack). The
> above is starting from an already reasonably well packed state and reuses
> deltas; with "-f" to repack everything from scratch it would take
> significantly longer:
>
>    linux-3.0/master$ time git repack -a -d -f
>    Counting objects: 2138578, done.
>    Delta compression using up to 4 threads.
>    Compressing objects: 100% (2118691/2118691), done.
>    Writing objects: 100% (2138578/2138578), done.
>    Total 2138578 (delta 1749156), reused 344219 (delta 0)
>
>    real    3m26.750s
>    user    8m41.857s
>    sys     0m6.716s

I've run the command and it took about 20 minutes in "Counting
objects" to count up to 500000 on idle machine and there's still 700MB
RAM free.

I wonder, when you do the repacking from a packed state, does it
physically create files on file system? In my case I have lots of
files in objects dir:

$ ls objects/ | wc -l
258
$ ls objects/00 | wc -l
6173

When I tried 'find objects | wc -l' previously (when repack was not
running) it got "stuck" too and I got impatient and killed it ;)

So it looks it's not a problem with git but rather with my disk/file
system/linux...


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-04  7:59       ` Piotr Krukowiecki
@ 2011-10-04 11:07         ` Jeff King
  2011-10-04 12:22           ` Piotr Krukowiecki
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2011-10-04 11:07 UTC (permalink / raw)
  To: Piotr Krukowiecki
  Cc: Junio C Hamano, Shawn Pearce, Git Mailing List, Ingo Molnar

On Tue, Oct 04, 2011 at 09:59:08AM +0200, Piotr Krukowiecki wrote:

> I've run the command and it took about 20 minutes in "Counting
> objects" to count up to 500000 on idle machine and there's still 700MB
> RAM free.
> [...]
> So it looks it's not a problem with git but rather with my disk/file
> system/linux...

You mentioned that git was in the 'D' state earlier. And it sounds like
you have 1.7 million objects, _completely_ unpacked.

So my guess is that it is simply taking an enormous amount of disk
space, and git is mostly waiting on the disk to read in files. What does
"du -sh .git/objects" say?

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-04 11:07         ` Jeff King
@ 2011-10-04 12:22           ` Piotr Krukowiecki
  2011-10-04 12:45             ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Piotr Krukowiecki @ 2011-10-04 12:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Shawn Pearce, Git Mailing List, Ingo Molnar

On Tue, Oct 4, 2011 at 1:07 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Oct 04, 2011 at 09:59:08AM +0200, Piotr Krukowiecki wrote:
>
>> I've run the command and it took about 20 minutes in "Counting
>> objects" to count up to 500000 on idle machine and there's still 700MB
>> RAM free.
>> [...]
>> So it looks it's not a problem with git but rather with my disk/file
>> system/linux...
>
> You mentioned that git was in the 'D' state earlier. And it sounds like
> you have 1.7 million objects, _completely_ unpacked.

That's right - since I had auto-gc disabled at first it had not chance
to pack anything.


> So my guess is that it is simply taking an enormous amount of disk
> space, and git is mostly waiting on the disk to read in files. What does
> "du -sh .git/objects" say?

It isn't that big - it's 11G.
.git/objects/pack/ is 666MB currently.


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-04 12:22           ` Piotr Krukowiecki
@ 2011-10-04 12:45             ` Jeff King
  2011-10-04 13:21               ` Piotr Krukowiecki
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2011-10-04 12:45 UTC (permalink / raw)
  To: Piotr Krukowiecki
  Cc: Junio C Hamano, Shawn Pearce, Git Mailing List, Ingo Molnar

On Tue, Oct 04, 2011 at 02:22:55PM +0200, Piotr Krukowiecki wrote:

> > So my guess is that it is simply taking an enormous amount of disk
> > space, and git is mostly waiting on the disk to read in files. What does
> > "du -sh .git/objects" say?
> 
> It isn't that big - it's 11G.
> .git/objects/pack/ is 666MB currently.

But you have 4G of RAM, no? So depending on the access patterns, you are
thrashing your disk cache and always pulling each object straight from
disk.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-04 12:45             ` Jeff King
@ 2011-10-04 13:21               ` Piotr Krukowiecki
  2011-10-04 18:08                 ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Piotr Krukowiecki @ 2011-10-04 13:21 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Shawn Pearce, Git Mailing List, Ingo Molnar

On Tue, Oct 4, 2011 at 2:45 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Oct 04, 2011 at 02:22:55PM +0200, Piotr Krukowiecki wrote:
>
>> > So my guess is that it is simply taking an enormous amount of disk
>> > space, and git is mostly waiting on the disk to read in files. What does
>> > "du -sh .git/objects" say?
>>
>> It isn't that big - it's 11G.
>> .git/objects/pack/ is 666MB currently.
>
> But you have 4G of RAM, no? So depending on the access patterns, you are
> thrashing your disk cache and always pulling each object straight from
> disk.

I have 4GB ram + 4GB swap. Is it possible the RAM is the problem if I
always have free RAM left and my swap is almost not used?
For example at the moment repack finished counting objects ("Counting
objects: 1742200, done."):

$ free -m
             total       used       free     shared    buffers     cached
Mem:          3960       3814        146          0        441        215
-/+ buffers/cache:       3157        803
Swap:         6143        694       5449

$ ps auxwwww | grep git
pkruk    13541  0.0  0.0  15704   716 pts/2    S+   13:19   0:00 git
repack -a -d -f
pkruk    13542  0.0  0.0   4220   540 pts/2    S+   13:19   0:00
/bin/sh /usr/local/stow/git-master/libexec/git-core/git-repack -a -d
-f
pkruk    13556  3.9  9.8 1143628 401232 pts/2  DN+  13:19   4:25 git
pack-objects --keep-true-parents --honor-pack-keep --non-empty --all
--reflog --no-reuse-delta --delta-base-offset
/home/pkruk/dv/devel1_git_repos/.git/objects/pack/.tmp-13542-pack


I have updated to 1.7.7 btw.

-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-04 13:21               ` Piotr Krukowiecki
@ 2011-10-04 18:08                 ` Jeff King
  2011-10-05  8:48                   ` Piotr Krukowiecki
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2011-10-04 18:08 UTC (permalink / raw)
  To: Piotr Krukowiecki
  Cc: Junio C Hamano, Shawn Pearce, Git Mailing List, Ingo Molnar

On Tue, Oct 04, 2011 at 03:21:24PM +0200, Piotr Krukowiecki wrote:

> I have 4GB ram + 4GB swap. Is it possible the RAM is the problem if I
> always have free RAM left and my swap is almost not used?
> For example at the moment repack finished counting objects ("Counting
> objects: 1742200, done."):
> 
> $ free -m
>              total       used       free     shared    buffers     cached
> Mem:          3960       3814        146          0        441        215
> -/+ buffers/cache:       3157        803
> Swap:         6143        694       5449

I am not the best person to comment on Linux's disk caching strategies,
but in general, it should prefer dropping disk cache over pushing
program memory into swap. So no, you're not swapping, but you are
working with only 800M or so to do your disk caching.

So depending how big pack-object's working set of objects is, we might
be overflowing that, and constantly evicting and re-reading objects. I
don't recall offhand what kind of locality there is to pack-object's
accesses.

One thing you could try to reduce the working set is to incrementally
pack some smaller chunks, and then combine them all at the end. That
ends up being more work overall, but at any given time, your working set
of objects will be smaller.

You'd have to do something like this (this is very untested):

  # find out how many revisions we have. Let's pretend it's about
  # 25,000.
  git rev-list HEAD | wc -l

  # now split them into chunks of whatever size you feel like trying.
  # 1000, maybe, or a few thousand. Bearing in mind that this is a gross
  # approximation, since the history is not linear.
  #
  # Start with HEAD~24K (25K total, minus 1K we want to pack)
  echo HEAD~24000 | git pack-objects --revs .git/objects/pack/pack
  # And then prune the loose objects that we just packed.
  git prune-packed
  # And repeat for the next chunk
  echo HEAD~24000..HEAD~23000 | git pack-objects --revs .git/objects/pack/pack
  git prune-packed
  # And so forth...

And then at the end, probably do a "git repack -ad" to put it all in
one big pack. Which should hopefully be less disk-intensive, because now
you'll have a much smaller disk footprint, since most of your objects
are at least delta'd against the others in their own pack.

I have no idea if this will actually go faster for you. But it might be
worth trying, instead of just redoing the svn import with auto-gc turned
on.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: pack-object poor performance (with large number of objects?)
  2011-10-04 18:08                 ` Jeff King
@ 2011-10-05  8:48                   ` Piotr Krukowiecki
  0 siblings, 0 replies; 11+ messages in thread
From: Piotr Krukowiecki @ 2011-10-05  8:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Shawn Pearce, Git Mailing List, Ingo Molnar

On Tue, Oct 4, 2011 at 8:08 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Oct 04, 2011 at 03:21:24PM +0200, Piotr Krukowiecki wrote:
>
>> I have 4GB ram + 4GB swap. Is it possible the RAM is the problem if I
>> always have free RAM left and my swap is almost not used?
>> For example at the moment repack finished counting objects ("Counting
>> objects: 1742200, done."):
>>
>> $ free -m
>>              total       used       free     shared    buffers     cached
>> Mem:          3960       3814        146          0        441        215
>> -/+ buffers/cache:       3157        803
>> Swap:         6143        694       5449
>
[...]
> I have no idea if this will actually go faster for you. But it might be
> worth trying, instead of just redoing the svn import with auto-gc turned
> on.

I've left it to run over night and it finished (took almost 12 hours),
so hopefully I'm not going to run into this problem anymore.

$ time git repack -a -d -f
Counting objects: 1742200, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (1291909/1291909), done.
Writing objects: 100% (1742200/1742200), done.
Total 1742200 (delta 1094325), reused 39192 (delta 0)
Removing duplicate objects: 100% (256/256), done.

real	704m3.477s
user	65m35.960s
sys	9m50.880s

$ du -sh .git/objects/pack
3.9G	.git/objects/pack

$ git count-objects -v
count: 0
size: 0
in-pack: 1742200
packs: 1
size-pack: 4078245
prune-packable: 0
garbage: 0


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-10-05  8:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-03 14:43 pack-object poor performance (with large number of objects?) Piotr Krukowiecki
2011-10-03 16:05 ` Shawn Pearce
2011-10-03 17:17   ` Piotr Krukowiecki
2011-10-03 19:34     ` Junio C Hamano
2011-10-04  7:59       ` Piotr Krukowiecki
2011-10-04 11:07         ` Jeff King
2011-10-04 12:22           ` Piotr Krukowiecki
2011-10-04 12:45             ` Jeff King
2011-10-04 13:21               ` Piotr Krukowiecki
2011-10-04 18:08                 ` Jeff King
2011-10-05  8:48                   ` Piotr Krukowiecki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).