* pack operation is thrashing my server
@ 2008-08-10 19:47 Ken Pratt
2008-08-10 23:06 ` Martin Langhoff
` (2 more replies)
0 siblings, 3 replies; 80+ messages in thread
From: Ken Pratt @ 2008-08-10 19:47 UTC (permalink / raw)
To: git
Hi,
I'm having memory issues when trying to clone a remote git repository.
I'm running: "git clone git+ssh://user@foo.bar.com/var/git/foo"
The remote repository is bare, and is 180MB in size (says du), with
1824 objects. The remote (VPS) server is running git version 1.5.6.4
on Arch Linux on a x86_64 Opteron with 256MB of dedicated RAM.
The clone command fires off some packing operations that bring the
server to its knees:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21782 kenpratt 20 0 444m 212m 272 D 3 83.0 0:04.98 git-pack-object
The clone also seems to hang forever. Progress stays at 0% for hours,
and it never progresses past compressing the first object.
I've tried very conservative pack settings:
[pack]
threads = 1
windowmemory = 64M
deltacachesize = 1M
deltacachelimit = 1M
[pack]
threads = 1
windowmemory = 16M
deltacachesize = 16M
deltacachelimit = 0
I've tries many variations like those, but nothing seems to help.
A "git repack -a -d" only takes 5 seconds to run on the same
repository on my laptop (a non-bare copy), and seems to peak at ~160MB
of RAM usage.
Any tips/help would be greatly appreciated. This repository is still
small -- it will eventually grow to multiple GB in size, as it is a
mix of small text files and binaries ranging in size from 2MB to
200MB. Is it not feasible to clone repositories of that size that are
hosted on a server with 256MB of RAM?
Thanks!
Ken
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-10 19:47 pack operation is thrashing my server Ken Pratt
@ 2008-08-10 23:06 ` Martin Langhoff
2008-08-10 23:12 ` Ken Pratt
2008-08-11 3:04 ` Shawn O. Pearce
2008-08-13 12:43 ` Jakub Narebski
2 siblings, 1 reply; 80+ messages in thread
From: Martin Langhoff @ 2008-08-10 23:06 UTC (permalink / raw)
To: Ken Pratt; +Cc: git
On Mon, Aug 11, 2008 at 7:47 AM, Ken Pratt <ken@kenpratt.net> wrote:
> A "git repack -a -d" only takes 5 seconds to run on the same
> repository on my laptop (a non-bare copy), and seems to peak at ~160MB
> of RAM usage.
As a workaround, if you repack on your laptop and rsync the pack+index
to the server, it will work. This can be used to serve huge projects
out of lightweight-ish servers. Yet another workaround is to perform
initial clones via rsync or http.
In your case, I agree that the repo doesn't seem large enough (or to
have large enough objects) to warrant having this problem. But that I
can't help much with myself - pack-machiner experts probably can.
cheers,
m
--
martin.langhoff@gmail.com
martin@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-10 23:06 ` Martin Langhoff
@ 2008-08-10 23:12 ` Ken Pratt
2008-08-10 23:30 ` Martin Langhoff
0 siblings, 1 reply; 80+ messages in thread
From: Ken Pratt @ 2008-08-10 23:12 UTC (permalink / raw)
To: Martin Langhoff; +Cc: git
Thanks for the tips, Martin.
How does git over rsync work? It is unauthenticated, like git over
http? Or authenticated, like git+ssh?
Great ideas though. Unfortunately I don't think I'll be able to use
the repack locally and then upload strategy for this particular
workflow, but the rsync clone approach might do it.
-Ken
On Sun, Aug 10, 2008 at 4:06 PM, Martin Langhoff
<martin.langhoff@gmail.com> wrote:
> On Mon, Aug 11, 2008 at 7:47 AM, Ken Pratt <ken@kenpratt.net> wrote:
>> A "git repack -a -d" only takes 5 seconds to run on the same
>> repository on my laptop (a non-bare copy), and seems to peak at ~160MB
>> of RAM usage.
>
> As a workaround, if you repack on your laptop and rsync the pack+index
> to the server, it will work. This can be used to serve huge projects
> out of lightweight-ish servers. Yet another workaround is to perform
> initial clones via rsync or http.
>
> In your case, I agree that the repo doesn't seem large enough (or to
> have large enough objects) to warrant having this problem. But that I
> can't help much with myself - pack-machiner experts probably can.
>
> cheers,
>
>
> m
> --
> martin.langhoff@gmail.com
> martin@laptop.org -- School Server Architect
> - ask interesting questions
> - don't get distracted with shiny stuff - working code first
> - http://wiki.laptop.org/go/User:Martinlanghoff
>
--
Ken Pratt
http://kenpratt.net/
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-10 23:12 ` Ken Pratt
@ 2008-08-10 23:30 ` Martin Langhoff
2008-08-10 23:34 ` Ken Pratt
0 siblings, 1 reply; 80+ messages in thread
From: Martin Langhoff @ 2008-08-10 23:30 UTC (permalink / raw)
To: Ken Pratt; +Cc: git
On Mon, Aug 11, 2008 at 11:12 AM, Ken Pratt <ken@kenpratt.net> wrote:
> Thanks for the tips, Martin.
NP! :-)
> How does git over rsync work? It is unauthenticated, like git over
> http? Or authenticated, like git+ssh?
I've always used it as rsync+ssh. Not sure about bare rsync.
> Great ideas though. Unfortunately I don't think I'll be able to use
> the repack locally and then upload strategy for this particular
> workflow, but the rsync clone approach might do it.
A few specific versions of git had bad repack cpu/memory usage
patterns, so an update to git might help. In any case, the repack
machinery experts are probably asleep. Give it a bit of time and
smarter answers will probably materialise.
cheers,
m
--
martin.langhoff@gmail.com
martin@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-10 23:30 ` Martin Langhoff
@ 2008-08-10 23:34 ` Ken Pratt
0 siblings, 0 replies; 80+ messages in thread
From: Ken Pratt @ 2008-08-10 23:34 UTC (permalink / raw)
To: Martin Langhoff; +Cc: git
Sounds good.
>> How does git over rsync work? It is unauthenticated, like git over
>> http? Or authenticated, like git+ssh?
>
> I've always used it as rsync+ssh. Not sure about bare rsync.
Do you use file-level rsync+ssh? Or rsync+ssh with git?
When I try a "git clone rsync+ssh://foo.bar.com/var/git/bar", I get a
"fatal: I don't handle protocol 'rsync+ssh'" error.
I know git supports the rsync protocol, but I don't think installing
an rsync server and using bare rsync will be an option in this case.
Thanks again,
Ken
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-10 19:47 pack operation is thrashing my server Ken Pratt
2008-08-10 23:06 ` Martin Langhoff
@ 2008-08-11 3:04 ` Shawn O. Pearce
2008-08-11 7:43 ` Ken Pratt
2008-08-13 12:43 ` Jakub Narebski
2 siblings, 1 reply; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-11 3:04 UTC (permalink / raw)
To: Ken Pratt; +Cc: git
Ken Pratt <ken@kenpratt.net> wrote:
> I'm having memory issues when trying to clone a remote git repository.
>
> The remote repository is bare, and is 180MB in size (says du), with
> 1824 objects. The remote (VPS) server is running git version 1.5.6.4
> on Arch Linux on a x86_64 Opteron with 256MB of dedicated RAM.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 21782 kenpratt 20 0 444m 212m 272 D 3 83.0 0:04.98 git-pack-object
Well, clearly the server is swapping at this point. 212m resident
for this git-pack-objects process leaves no room available for
anything else. Git is using too much memory for this system.
> I've tried very conservative pack settings:
>
> [pack]
> threads = 1
> windowmemory = 64M
> deltacachesize = 1M
> deltacachelimit = 1M
Have you tried something like this?
[core]
packedGitWindowSize = 16m
packedGitLimit = 64m
[pack]
threads = 1
windowMemory = 64m
deltaCacheSize = 1m
On a 64 bit system packedGitWindowSize and packedGitLimit have very
large thresholds which will cause it to mmap in the entire pack file.
You may need to try even smaller settings than these; 256m physical
memory isn't a lot when dealing with a repository 180m in size.
Especially on a 64 bit system.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 3:04 ` Shawn O. Pearce
@ 2008-08-11 7:43 ` Ken Pratt
2008-08-11 15:01 ` Shawn O. Pearce
2008-08-11 19:10 ` Andi Kleen
0 siblings, 2 replies; 80+ messages in thread
From: Ken Pratt @ 2008-08-11 7:43 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git
> Have you tried something like this?
>
> [core]
> packedGitWindowSize = 16m
> packedGitLimit = 64m
>
> [pack]
> threads = 1
> windowMemory = 64m
> deltaCacheSize = 1m
>
> On a 64 bit system packedGitWindowSize and packedGitLimit have very
> large thresholds which will cause it to mmap in the entire pack file.
> You may need to try even smaller settings than these; 256m physical
> memory isn't a lot when dealing with a repository 180m in size.
> Especially on a 64 bit system.
I just went as low as:
[core]
packedGitWindowSize = 1m
packedGitLimit = 4m
[pack]
threads = 1
windowMemory = 4m
deltaCacheSize = 128k
And it didn't make a dent in memory usage. Server is still swapping
within ~10 seconds of starting object compression.
I'm starting to think repacking is just not feasible on a 64-bit
server with 256MB of RAM (which is a very popular configuration in the
VPS market).
Thanks!
Ken
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 7:43 ` Ken Pratt
@ 2008-08-11 15:01 ` Shawn O. Pearce
2008-08-11 15:40 ` Avery Pennarun
2008-08-11 19:13 ` Ken Pratt
2008-08-11 19:10 ` Andi Kleen
1 sibling, 2 replies; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-11 15:01 UTC (permalink / raw)
To: Ken Pratt; +Cc: git
Ken Pratt <ken@kenpratt.net> wrote:
> I just went as low as:
>
> [core]
> packedGitWindowSize = 1m
> packedGitLimit = 4m
> [pack]
> threads = 1
> windowMemory = 4m
> deltaCacheSize = 128k
>
> And it didn't make a dent in memory usage. Server is still swapping
> within ~10 seconds of starting object compression.
>
> I'm starting to think repacking is just not feasible on a 64-bit
> server with 256MB of RAM (which is a very popular configuration in the
> VPS market).
What is the largest object in that repository? Do you have a
rough guess? You said earlier:
> The remote repository is bare, and is 180MB in size (says du), with
> 1824 objects.
That implies there is at least one really large object in that
repository. The average of 101KB per object is not going to be
a correct figure here as most commits and trees are _very_ tiny.
It must be a large object. Those big objects are going to consume
a lot of memory if they get inflated in memory.
You may very well be right that this particular repository of
yours is simply not packable on a 64 bit system with only 256M.
Packing takes a good chunk of memory as we maintain data about
every single object, plus we need working space to unpack several
objects at once so we can perform diffs to find deltas.
I'm not sure there are any more tunables you can try to tweak to
reduce the memory usage further. The configuration above is pushed
down about as low as it will go. For the most part the code is
pretty good about not exploding memory usage.
You said earlier this was Git 1.5.6.4. I recently fixed a bug in
the code that reads data from packs to prevent it from blowing out
memory usage, but that bug fix was included in 1.5.6.4.
On the up side, packing should only be consuming huge memory like
this when it needs to move loose objects into a pack file. I think
Martin Langhoff suggested packing this on your laptop then using
rsync over SSH to copy the pack file and .idx file to the server, so
the server didn't have to spend time figuring out the deltas itself.
Even though the clone command will fire off git-pack-objects the
pack-objects command will have a lot less work to do if the data
it needs is already stored in existing pack files.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 15:01 ` Shawn O. Pearce
@ 2008-08-11 15:40 ` Avery Pennarun
2008-08-11 15:59 ` Shawn O. Pearce
2008-08-11 19:13 ` Ken Pratt
1 sibling, 1 reply; 80+ messages in thread
From: Avery Pennarun @ 2008-08-11 15:40 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Ken Pratt, git
On Mon, Aug 11, 2008 at 11:01 AM, Shawn O. Pearce <spearce@spearce.org> wrote:
> On the up side, packing should only be consuming huge memory like
> this when it needs to move loose objects into a pack file. I think
> Martin Langhoff suggested packing this on your laptop then using
> rsync over SSH to copy the pack file and .idx file to the server, so
> the server didn't have to spend time figuring out the deltas itself.
Do you need to also introduce a ".keep" file to get the benefit from
this? I had a repo with some very large objects, and it was killing
my low-memory server *every* time I did "git gc", until I repacked on
another system, created the .keep file, and rsynced it back. Does
that make sense?
Thanks,
Avery
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 15:40 ` Avery Pennarun
@ 2008-08-11 15:59 ` Shawn O. Pearce
0 siblings, 0 replies; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-11 15:59 UTC (permalink / raw)
To: Avery Pennarun; +Cc: Ken Pratt, git
Avery Pennarun <apenwarr@gmail.com> wrote:
> On Mon, Aug 11, 2008 at 11:01 AM, Shawn O. Pearce <spearce@spearce.org> wrote:
> > On the up side, packing should only be consuming huge memory like
> > this when it needs to move loose objects into a pack file. I think
> > Martin Langhoff suggested packing this on your laptop then using
> > rsync over SSH to copy the pack file and .idx file to the server, so
> > the server didn't have to spend time figuring out the deltas itself.
>
> Do you need to also introduce a ".keep" file to get the benefit from
> this? I had a repo with some very large objects, and it was killing
> my low-memory server *every* time I did "git gc", until I repacked on
> another system, created the .keep file, and rsynced it back. Does
> that make sense?
No, the ".keep" file wouldn't have an impact. Delta reuse (the
feature I was alluding to) works whether or not there is a .keep
file present.
I wonder if your "git gc" was using --aggressive?
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 7:43 ` Ken Pratt
2008-08-11 15:01 ` Shawn O. Pearce
@ 2008-08-11 19:10 ` Andi Kleen
2008-08-11 19:15 ` Ken Pratt
` (2 more replies)
1 sibling, 3 replies; 80+ messages in thread
From: Andi Kleen @ 2008-08-11 19:10 UTC (permalink / raw)
To: Ken Pratt; +Cc: Shawn O. Pearce, git
"Ken Pratt" <ken@kenpratt.net> writes:
>
> I'm starting to think repacking is just not feasible on a 64-bit
> server with 256MB of RAM (which is a very popular configuration in the
> VPS market).
As a quick workaround you could try it with a 32bit git executable?
(assuming you have a distribution with proper multilib support)
I think the right fix would be to make git throttle itself (not
use mmap, use very small defaults etc.) on low memory systems.
It could take a look a /proc/meminfo for this.
-Andi
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 15:01 ` Shawn O. Pearce
2008-08-11 15:40 ` Avery Pennarun
@ 2008-08-11 19:13 ` Ken Pratt
1 sibling, 0 replies; 80+ messages in thread
From: Ken Pratt @ 2008-08-11 19:13 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git
> What is the largest object in that repository? Do you have a
> rough guess? You said earlier:
>
>> The remote repository is bare, and is 180MB in size (says du), with
>> 1824 objects.
>
> That implies there is at least one really large object in that
> repository. The average of 101KB per object is not going to be
> a correct figure here as most commits and trees are _very_ tiny.
> It must be a large object. Those big objects are going to consume
> a lot of memory if they get inflated in memory.
Largest object is ~150MB, and there are a couple 5-10MB objects as well.
> You said earlier this was Git 1.5.6.4. I recently fixed a bug in
> the code that reads data from packs to prevent it from blowing out
> memory usage, but that bug fix was included in 1.5.6.4.
I tried upgrading to 1.5.6.5 as well, but that didn't help.
> On the up side, packing should only be consuming huge memory like
> this when it needs to move loose objects into a pack file. I think
> Martin Langhoff suggested packing this on your laptop then using
> rsync over SSH to copy the pack file and .idx file to the server, so
> the server didn't have to spend time figuring out the deltas itself.
Unfortunately, that will only work as a band-aid solution for my
workflow. I think I'll have to limit the file size in the repository
to something that the server can handle.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 19:10 ` Andi Kleen
@ 2008-08-11 19:15 ` Ken Pratt
2008-08-13 2:38 ` Nicolas Pitre
2008-08-11 19:22 ` Shawn O. Pearce
2008-08-13 3:12 ` Geert Bosch
2 siblings, 1 reply; 80+ messages in thread
From: Ken Pratt @ 2008-08-11 19:15 UTC (permalink / raw)
To: Andi Kleen; +Cc: Shawn O. Pearce, git
> As a quick workaround you could try it with a 32bit git executable?
> (assuming you have a distribution with proper multilib support)
In this case, I do have control over the server (running Arch Linux,
which should do 32-bit multilib just fine), but for my workflow I
cannot assume that the server will have 32-bit git support.
I will use the previously mentioned solution of doing the packing
elsewhere for now as a band-aid, with hopes that this will get fixed
sometime soon.
Thanks!
-Ken
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 19:10 ` Andi Kleen
2008-08-11 19:15 ` Ken Pratt
@ 2008-08-11 19:22 ` Shawn O. Pearce
2008-08-11 19:29 ` Ken Pratt
2008-08-13 3:12 ` Geert Bosch
2 siblings, 1 reply; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-11 19:22 UTC (permalink / raw)
To: Andi Kleen; +Cc: Ken Pratt, git
Andi Kleen <andi@firstfloor.org> wrote:
> "Ken Pratt" <ken@kenpratt.net> writes:
> >
> > I'm starting to think repacking is just not feasible on a 64-bit
> > server with 256MB of RAM (which is a very popular configuration in the
> > VPS market).
>
> I think the right fix would be to make git throttle itself (not
> use mmap, use very small defaults etc.) on low memory systems.
> It could take a look a /proc/meminfo for this.
Well, we had thought it was already able to throttle itself, as
we did put code in to respond to mmap() and malloc() failures by
trying to release memory and retrying the failed operation again.
However what we don't do is try to limit our heap usage to some
limit that is smaller than physical memory. We just assume that
whatever we need is available from the OS. This fails when what
we need exceeds physical memory and the OS tries to use swap.
We can get better performance by reducing what we mmap instead.
:-|
Looking at /proc/meminfo only works on Linux, and maybe some other
OSes which support a /proc like design. But even then we don't
really know how much we are competing with other active processes
and how much memory we can use.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 19:22 ` Shawn O. Pearce
@ 2008-08-11 19:29 ` Ken Pratt
2008-08-11 19:34 ` Shawn O. Pearce
0 siblings, 1 reply; 80+ messages in thread
From: Ken Pratt @ 2008-08-11 19:29 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Andi Kleen, git
> Looking at /proc/meminfo only works on Linux, and maybe some other
> OSes which support a /proc like design. But even then we don't
> really know how much we are competing with other active processes
> and how much memory we can use.
Could we create a git config variable to specify the maximumum amoung
memory to mmap? Any if that variable wasn't explicitly set, it would
fall back on looking at /proc/meminfo?
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 19:29 ` Ken Pratt
@ 2008-08-11 19:34 ` Shawn O. Pearce
2008-08-11 20:10 ` Andi Kleen
0 siblings, 1 reply; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-11 19:34 UTC (permalink / raw)
To: Ken Pratt; +Cc: Andi Kleen, git
Ken Pratt <ken@kenpratt.net> wrote:
> > Looking at /proc/meminfo only works on Linux, and maybe some other
> > OSes which support a /proc like design. But even then we don't
> > really know how much we are competing with other active processes
> > and how much memory we can use.
>
> Could we create a git config variable to specify the maximumum amoung
> memory to mmap? Any if that variable wasn't explicitly set, it would
> fall back on looking at /proc/meminfo?
Well, core.packedGitLimit is supposed to be related to this limit
you are asking for. But it doesn't cover all memory usage as we
malloc other things. core.deltaBaseCacheLimit covers part of the
malloc'd area. pack.windowLimit I think covers another part of
the malloc'd area. Etc...
There really isn't a global "malloc/mmap at most X bytes".
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 19:34 ` Shawn O. Pearce
@ 2008-08-11 20:10 ` Andi Kleen
0 siblings, 0 replies; 80+ messages in thread
From: Andi Kleen @ 2008-08-11 20:10 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Ken Pratt, Andi Kleen, git
> There really isn't a global "malloc/mmap at most X bytes".
Sure it can never be 100% accurate because other processes
can also steal memory.
Still a 90+% heuristic can work pretty well. If memory < 512MB then don't
use mmap for example. If memory < 256MB do everything as tight
as possible. gcc is using such heuristics quite successfully.
The only problem might be testing coverage for such options.
It might be useful to add options to force it and then run
the test suite with it.
-Andi
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 19:15 ` Ken Pratt
@ 2008-08-13 2:38 ` Nicolas Pitre
2008-08-13 2:50 ` Andi Kleen
0 siblings, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-13 2:38 UTC (permalink / raw)
To: Ken Pratt; +Cc: Andi Kleen, Shawn O. Pearce, git
On Mon, 11 Aug 2008, Ken Pratt wrote:
> > As a quick workaround you could try it with a 32bit git executable?
> > (assuming you have a distribution with proper multilib support)
>
> In this case, I do have control over the server (running Arch Linux,
> which should do 32-bit multilib just fine), but for my workflow I
> cannot assume that the server will have 32-bit git support.
>
> I will use the previously mentioned solution of doing the packing
> elsewhere for now as a band-aid, with hopes that this will get fixed
> sometime soon.
I'm afraid no fix is "possible" since you said:
> Largest object is ~150MB, and there are a couple 5-10MB objects as
> well.
If you have only 256 MB of RAM, I'm afraid the machine dives into swap
the moment it attempts to process that single 150-MB object during
repacking. Objects are always allocated entirely, including the
deflated and inflated copy at some point. Making git handle partial
objects in memory would add complexity all over the map so I don't think
it'll ever be implemented nor be desirable.
If you do repack once with 'git repack -a -f -d' on a bigger machine
then 256 MB of RAM might be fine for serving clone and fetch requests
though.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 2:38 ` Nicolas Pitre
@ 2008-08-13 2:50 ` Andi Kleen
2008-08-13 2:57 ` Shawn O. Pearce
0 siblings, 1 reply; 80+ messages in thread
From: Andi Kleen @ 2008-08-13 2:50 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Ken Pratt, Andi Kleen, Shawn O. Pearce, git
> If you have only 256 MB of RAM, I'm afraid the machine dives into swap
> the moment it attempts to process that single 150-MB object during
> repacking. Objects are always allocated entirely, including the
> deflated and inflated copy at some point. Making git handle partial
> objects in memory would add complexity all over the map so I don't think
> it'll ever be implemented nor be desirable.
If the access pattern is sequential and not much reuse it might be possible
to madvise() strategically to do prefetch and early unmap of not used
anymore data. I used that successfully in a few programs in the past that did
aggressive mmap on very large files.
-Andi
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 2:50 ` Andi Kleen
@ 2008-08-13 2:57 ` Shawn O. Pearce
0 siblings, 0 replies; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-13 2:57 UTC (permalink / raw)
To: Andi Kleen; +Cc: Nicolas Pitre, Ken Pratt, git
Andi Kleen <andi@firstfloor.org> wrote:
> > If you have only 256 MB of RAM, I'm afraid the machine dives into swap
> > the moment it attempts to process that single 150-MB object during
> > repacking. Objects are always allocated entirely, including the
> > deflated and inflated copy at some point. Making git handle partial
> > objects in memory would add complexity all over the map so I don't think
> > it'll ever be implemented nor be desirable.
>
> If the access pattern is sequential and not much reuse it might be possible
> to madvise() strategically to do prefetch and early unmap of not used
> anymore data. I used that successfully in a few programs in the past that did
> aggressive mmap on very large files.
We actually do something better where we can. However parts of
Git assume that it can get back a contiguous block of memory which
contains the entire file content, decompressed. The data is stored
on disk compressed, so we cannot just mmap the data from disk.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-11 19:10 ` Andi Kleen
2008-08-11 19:15 ` Ken Pratt
2008-08-11 19:22 ` Shawn O. Pearce
@ 2008-08-13 3:12 ` Geert Bosch
2008-08-13 3:15 ` Shawn O. Pearce
2008-08-13 14:35 ` Nicolas Pitre
2 siblings, 2 replies; 80+ messages in thread
From: Geert Bosch @ 2008-08-13 3:12 UTC (permalink / raw)
To: Andi Kleen; +Cc: Ken Pratt, Shawn O. Pearce, git
On Aug 11, 2008, at 15:10, Andi Kleen wrote:
> As a quick workaround you could try it with a 32bit git executable?
> (assuming you have a distribution with proper multilib support)
>
> I think the right fix would be to make git throttle itself (not
> use mmap, use very small defaults etc.) on low memory systems.
> It could take a look a /proc/meminfo for this.
I've always felt that keeping largish objects (say anything >1MB)
loose makes perfect sense. These objects are accessed infrequently,
often binary or otherwise poor candidates for the delta algorithm.
Many repositories are mostly well-behaved with large number of text
files that aren't overly large and compress/diff well. However, often
a few huge files creep in. These might be a 30 MB Word or PDF documents
(with lots of images of course), a bunch of artwork, some random .tgz
files
with required tools or otherwise.
Regardless of their origin, the presence of such files in real-world
SCMs
is a given and can ruin performance, even if they're hardly ever
accessed
or updated. If we would leave such oddball objects loose, the pack would
be much smaller, easier to generate, faster to use and there should be
no
memory usage issues.
-Geert
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 3:12 ` Geert Bosch
@ 2008-08-13 3:15 ` Shawn O. Pearce
2008-08-13 3:58 ` Geert Bosch
2008-08-13 14:35 ` Nicolas Pitre
1 sibling, 1 reply; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-13 3:15 UTC (permalink / raw)
To: Geert Bosch; +Cc: Andi Kleen, Ken Pratt, git
Geert Bosch <bosch@adacore.com> wrote:
> I've always felt that keeping largish objects (say anything >1MB)
> loose makes perfect sense. These objects are accessed infrequently,
> often binary or otherwise poor candidates for the delta algorithm.
Sadly this causes huge problems with streaming a pack because the
loose object has to be inflated and then delfated again to fit into
the pack stream.
The new style loose object format was meant to fix this problem,
and it did, but the code was difficult to manage so it was backed
out of the tree.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 3:15 ` Shawn O. Pearce
@ 2008-08-13 3:58 ` Geert Bosch
2008-08-13 14:37 ` Nicolas Pitre
0 siblings, 1 reply; 80+ messages in thread
From: Geert Bosch @ 2008-08-13 3:58 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Andi Kleen, Ken Pratt, git
On Aug 12, 2008, at 23:15, Shawn O. Pearce wrote:
> Geert Bosch <bosch@adacore.com> wrote:
>> I've always felt that keeping largish objects (say anything >1MB)
>> loose makes perfect sense. These objects are accessed infrequently,
>> often binary or otherwise poor candidates for the delta algorithm.
>
> Sadly this causes huge problems with streaming a pack because the
> loose object has to be inflated and then delfated again to fit into
> the pack stream.
Sure, but that really is not that much of an issue. For people
with large systems connected by very fast networks, the current
situation is probably fine, and spending a lot of effort for
packing often makes sense.
However, for a random repository of Joe User, all the effort spent
on packing will probably never be gained back. Most people just
suck content from upstream and at most maintain a couple of local
hacks on top of that. Little or nothing is ever pushed to other
systems.
Even when pushing to other systems, this often is just a handful of
objects
though a slow line and compression/decompression speeds just don't
matter
much.
> The new style loose object format was meant to fix this problem,
> and it did, but the code was difficult to manage so it was backed
> out of the tree.
One nice optimization we could do for those pesky binary large objects
(like PDF, JPG and GZIP-ed data), is to detect such files and revert
to compression level 0. This should be especially beneficial
since already compressed data takes most time to compress again.
-Geert
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-10 19:47 pack operation is thrashing my server Ken Pratt
2008-08-10 23:06 ` Martin Langhoff
2008-08-11 3:04 ` Shawn O. Pearce
@ 2008-08-13 12:43 ` Jakub Narebski
2 siblings, 0 replies; 80+ messages in thread
From: Jakub Narebski @ 2008-08-13 12:43 UTC (permalink / raw)
To: Ken Pratt; +Cc: git
[...]
If I remember correctly there were on git mailing list some patches by
Dana How which put an upper bound on the size of individual objects
going to pack; objects with size above threshold would be left as
loose object (and shared via network drive).
Unfortunately if I remember correctly they were not accepted in git.
You can try to pack large objects into separate pack, and .keep it,
or try to ressurect the patches from git mailing list archive.
HTH.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 3:12 ` Geert Bosch
2008-08-13 3:15 ` Shawn O. Pearce
@ 2008-08-13 14:35 ` Nicolas Pitre
2008-08-13 14:59 ` Shawn O. Pearce
2008-08-13 16:01 ` Geert Bosch
1 sibling, 2 replies; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-13 14:35 UTC (permalink / raw)
To: Geert Bosch; +Cc: Andi Kleen, Ken Pratt, Shawn O. Pearce, git
On Tue, 12 Aug 2008, Geert Bosch wrote:
> I've always felt that keeping largish objects (say anything >1MB)
> loose makes perfect sense. These objects are accessed infrequently,
> often binary or otherwise poor candidates for the delta algorithm.
Or, as I suggested in the past, they can be grouped into a separate
pack, or even occupy a pack of their own. As soon as you have more than
one revision of such largish objects then you lose again by keeping them
loose.
> Many repositories are mostly well-behaved with large number of text
> files that aren't overly large and compress/diff well. However, often
> a few huge files creep in. These might be a 30 MB Word or PDF documents
> (with lots of images of course), a bunch of artwork, some random .tgz files
> with required tools or otherwise.
>
> Regardless of their origin, the presence of such files in real-world SCMs
> is a given and can ruin performance, even if they're hardly ever accessed
> or updated. If we would leave such oddball objects loose, the pack would
> be much smaller, easier to generate, faster to use and there should be no
> memory usage issues.
You'll have memory usage issues whenever such objects are accessed,
loose or not. However, once those big objects are packed once, they can
be repacked (or streamed over the net) without really "accessing" them.
Packed object data is simply copied into a new pack in that case which
is less of an issue on memory usage, irrespective of the original pack
size.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 3:58 ` Geert Bosch
@ 2008-08-13 14:37 ` Nicolas Pitre
2008-08-13 14:56 ` Jakub Narebski
0 siblings, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-13 14:37 UTC (permalink / raw)
To: Geert Bosch; +Cc: Shawn O. Pearce, Andi Kleen, Ken Pratt, git
On Tue, 12 Aug 2008, Geert Bosch wrote:
> One nice optimization we could do for those pesky binary large objects
> (like PDF, JPG and GZIP-ed data), is to detect such files and revert
> to compression level 0. This should be especially beneficial
> since already compressed data takes most time to compress again.
That would be a good thing indeed.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 14:37 ` Nicolas Pitre
@ 2008-08-13 14:56 ` Jakub Narebski
2008-08-13 15:04 ` Shawn O. Pearce
0 siblings, 1 reply; 80+ messages in thread
From: Jakub Narebski @ 2008-08-13 14:56 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Geert Bosch, Shawn O. Pearce, Andi Kleen, Ken Pratt, git
Nicolas Pitre <nico@cam.org> writes:
> On Tue, 12 Aug 2008, Geert Bosch wrote:
>
> > One nice optimization we could do for those pesky binary large objects
> > (like PDF, JPG and GZIP-ed data), is to detect such files and revert
> > to compression level 0. This should be especially beneficial
> > since already compressed data takes most time to compress again.
>
> That would be a good thing indeed.
Perhaps take a sample of some given size and calculate entropy in it?
Or just simply add gitattribute for per file compression ratio...
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 14:35 ` Nicolas Pitre
@ 2008-08-13 14:59 ` Shawn O. Pearce
2008-08-13 15:43 ` Nicolas Pitre
2008-08-13 16:01 ` Geert Bosch
1 sibling, 1 reply; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-13 14:59 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Geert Bosch, Andi Kleen, Ken Pratt, git
Nicolas Pitre <nico@cam.org> wrote:
> You'll have memory usage issues whenever such objects are accessed,
> loose or not. However, once those big objects are packed once, they can
> be repacked (or streamed over the net) without really "accessing" them.
> Packed object data is simply copied into a new pack in that case which
> is less of an issue on memory usage, irrespective of the original pack
> size.
And fortunately here we actually do stream the objects we have
chosen to reuse from the pack. We don't allocate the entire thing
in memory. Its probably the only place in all of Git where we can
handle a 16 GB (after compression) object on a machine with only
2 GB of memory and no swap.
Where little memory systems get into trouble with already packed
repositories is enumerating the objects to include in the pack.
This can still blow out their physical memory if the number of
objects to pack is high enough. We need something like 160 bytes
of memory (my own memory is fuzzy on that estimate) per object.
Have 500k objects and its suddenly something quite real in terms
of memory usage.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 14:56 ` Jakub Narebski
@ 2008-08-13 15:04 ` Shawn O. Pearce
2008-08-13 15:26 ` David Tweed
2008-08-13 16:10 ` Johan Herland
0 siblings, 2 replies; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-13 15:04 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Nicolas Pitre, Geert Bosch, Andi Kleen, Ken Pratt, git
Jakub Narebski <jnareb@gmail.com> wrote:
> Nicolas Pitre <nico@cam.org> writes:
> > On Tue, 12 Aug 2008, Geert Bosch wrote:
> >
> > > One nice optimization we could do for those pesky binary large objects
> > > (like PDF, JPG and GZIP-ed data), is to detect such files and revert
> > > to compression level 0. This should be especially beneficial
> > > since already compressed data takes most time to compress again.
> >
> > That would be a good thing indeed.
>
> Perhaps take a sample of some given size and calculate entropy in it?
> Or just simply add gitattribute for per file compression ratio...
Estimating the entropy would make it "just magic". Most of Git is
"just magic" so that's a good direction to take. I'm not familiar
enough with the PDF/JPG/GZIP/ZIP stream formats to know what the
first 4-8k looks like to know if it would give a good indication
of being already compressed.
Though I'd imagine looking at the first 4k should be sufficient
for any compressed file. Having a header composed of 4k of _text_
before binary compressed data would be nuts. Or a git-bundle with
a large refs listing. ;-)
Using a gitattribute inside of pack-objects is not "simple".
We currently only support reading attributes from the working
directory if I recall correctly. pack-objects may not have a
working directory.
Hence, "just magic" is probably the better route.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 15:04 ` Shawn O. Pearce
@ 2008-08-13 15:26 ` David Tweed
2008-08-13 23:54 ` Martin Langhoff
2008-08-13 16:10 ` Johan Herland
1 sibling, 1 reply; 80+ messages in thread
From: David Tweed @ 2008-08-13 15:26 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Jakub Narebski, Nicolas Pitre, Geert Bosch, Andi Kleen, Ken Pratt,
git
On Wed, Aug 13, 2008 at 4:04 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
>> Nicolas Pitre <nico@cam.org> writes:
>> > On Tue, 12 Aug 2008, Geert Bosch wrote:
>> >
>> > > One nice optimization we could do for those pesky binary large objects
>> > > (like PDF, JPG and GZIP-ed data), is to detect such files and revert
>> > > to compression level 0. This should be especially beneficial
>> > > since already compressed data takes most time to compress again.
>> >
>> > That would be a good thing indeed.
>>
>> Perhaps take a sample of some given size and calculate entropy in it?
>> Or just simply add gitattribute for per file compression ratio...
>
> Estimating the entropy would make it "just magic". Most of Git is
> "just magic" so that's a good direction to take. I'm not familiar
> enough with the PDF/JPG/GZIP/ZIP stream formats to know what the
> first 4-8k looks like to know if it would give a good indication
> of being already compressed.
>
> Though I'd imagine looking at the first 4k should be sufficient
> for any compressed file. Having a header composed of 4k of _text_
> before binary compressed data would be nuts. Or a git-bundle with
> a large refs listing. ;-)
FWIW, PDF format is a mix of sections of uncompressed higher level
ASCII notation and sections of compressed actual glyph/location data
for individual pages, and I don't think the rules are very strict
about what goes where. Looking at some academic papers some contain
compressed data within the first hundred characters whilst I've got a
couple with the first compressed byte 1968 and 12304; I'm sure if I
had a longer pdf to look at I'd find one where compression data first
occurred even later. I leave discussions of whether this is nuts to
others ;-) .
JPG is pretty much guaranteed to contain compressed data after a
couple of metadata lines.
--
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"while having code so boring anyone can maintain it, use Python." --
attempted insult seen on slashdot
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 14:59 ` Shawn O. Pearce
@ 2008-08-13 15:43 ` Nicolas Pitre
2008-08-13 15:50 ` Shawn O. Pearce
0 siblings, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-13 15:43 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Geert Bosch, Andi Kleen, Ken Pratt, git
On Wed, 13 Aug 2008, Shawn O. Pearce wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > You'll have memory usage issues whenever such objects are accessed,
> > loose or not. However, once those big objects are packed once, they can
> > be repacked (or streamed over the net) without really "accessing" them.
> > Packed object data is simply copied into a new pack in that case which
> > is less of an issue on memory usage, irrespective of the original pack
> > size.
>
> And fortunately here we actually do stream the objects we have
> chosen to reuse from the pack. We don't allocate the entire thing
> in memory. Its probably the only place in all of Git where we can
> handle a 16 GB (after compression) object on a machine with only
> 2 GB of memory and no swap.
>
> Where little memory systems get into trouble with already packed
> repositories is enumerating the objects to include in the pack.
> This can still blow out their physical memory if the number of
> objects to pack is high enough. We need something like 160 bytes
> of memory (my own memory is fuzzy on that estimate) per object.
I'm counting something like 104 bytes on a 64-bit machine for
struct object_entry.
> Have 500k objects and its suddenly something quite real in terms
> of memory usage.
Well, we are talking about 50MB which is not that bad.
However there is a point where we should be realistic and just admit
that you need a sufficiently big machine if you have huge repositories
to deal with. Git should be fine serving pull requests with relatively
little memory usage, but anything else such as the initial repack simply
require enough RAM to be effective.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 15:43 ` Nicolas Pitre
@ 2008-08-13 15:50 ` Shawn O. Pearce
2008-08-13 17:04 ` Nicolas Pitre
0 siblings, 1 reply; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-13 15:50 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Geert Bosch, Andi Kleen, Ken Pratt, git
Nicolas Pitre <nico@cam.org> wrote:
> On Wed, 13 Aug 2008, Shawn O. Pearce wrote:
> >
> > Where little memory systems get into trouble with already packed
> > repositories is enumerating the objects to include in the pack.
>
> I'm counting something like 104 bytes on a 64-bit machine for
> struct object_entry.
Don't forget that we need not just struct object_entry, but
also the struct commit/tree/blob, their hash tables, and the
struct object_entry* in the sorted object list table, and
the pack reverse index table. It does add up.
> > Have 500k objects and its suddenly something quite real in terms
> > of memory usage.
>
> Well, we are talking about 50MB which is not that bad.
I think we're closer to 100MB here due to the extra overheads
I just alluded to above, and which weren't in your 104 byte
per object figure.
> However there is a point where we should be realistic and just admit
> that you need a sufficiently big machine if you have huge repositories
> to deal with. Git should be fine serving pull requests with relatively
> little memory usage, but anything else such as the initial repack simply
> require enough RAM to be effective.
Yea. But it would also be nice to be able to just concat packs
together. Especially if the repository in question is an open source
one and everything published is already known to be in the wild,
as say it is also available over dumb HTTP. Yea, I know people
like the 'security feature' of the packer not including objects
which aren't reachable.
But how many times has Linus published something to his linux-2.6
tree that he didn't mean to publish and had to rewind? I think
that may be "never". Yet how many times per day does his tree get
cloned from scratch?
This is also true for many internal corporate repositories.
Users probably have full read access to the object database anyway,
and maybe even have direct write access to it. Doing the object
enumeration there is pointless as a security measure.
I'm too busy to write a pack concat implementation proposal, so
I'll just shutup now. But it wouldn't be hard if someone wanted
to improve at least the initial clone serving case.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 14:35 ` Nicolas Pitre
2008-08-13 14:59 ` Shawn O. Pearce
@ 2008-08-13 16:01 ` Geert Bosch
2008-08-13 17:13 ` Dana How
2008-08-13 17:26 ` Nicolas Pitre
1 sibling, 2 replies; 80+ messages in thread
From: Geert Bosch @ 2008-08-13 16:01 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Andi Kleen, Ken Pratt, Shawn O. Pearce, git
On Aug 13, 2008, at 10:35, Nicolas Pitre wrote:
> On Tue, 12 Aug 2008, Geert Bosch wrote:
>
>> I've always felt that keeping largish objects (say anything >1MB)
>> loose makes perfect sense. These objects are accessed infrequently,
>> often binary or otherwise poor candidates for the delta algorithm.
>
> Or, as I suggested in the past, they can be grouped into a separate
> pack, or even occupy a pack of their own.
This is fine, as long as we're not trying to create deltas
of the large objects, or do other things that requires keeping
the inflated data in memory.
> As soon as you have more than
> one revision of such largish objects then you lose again by keeping
> them
> loose.
Yes, you lose potentially in terms of disk space, but you avoid the
large memory footprint during pack generation. For very large blobs,
it is best to degenerate to having each revision of each file on
its own (whether we call it a single-file pack, loose object or
whatever).
That way, the large file can stay immutable on disk, and will only
need to be accessed during checkout. GIT will then scale with good
performance until we run out of disk space.
The alternative is that people need to keep large binary data out
of their SCMs and handle it on the side. Consider a large web site
where I have all scripts, HTML content, as well as a few movies
to manage. The movies basically should be copied and stored, only
to be accessed when a checkout (or push) is requested.
If we mix the very large movies with the 100,000 objects representing
the webpages, the resulting pack will become unwieldy and slow even
to just copy around during repacks.
> You'll have memory usage issues whenever such objects are accessed,
> loose or not.
Why? The only time we'd need to access their contents for checkout
or when pushing across the network. These should all be steaming
operations
with small memory footprint.
> However, once those big objects are packed once, they can
> be repacked (or streamed over the net) without really "accessing"
> them.
> Packed object data is simply copied into a new pack in that case which
> is less of an issue on memory usage, irrespective of the original pack
> size.
Agreed, but still, at least very large objects. If I have a 600MB
file in my repository, it should just not get in the way. If it gets
copied around during each repack, that just wastes I/O time for no
good reason. Even worse, it causes incremental backups or filesystem
checkpoints to become way more expensive. Just leaving large files
alone as immutable objects on disk avoids all these issues.
-Geert
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 15:04 ` Shawn O. Pearce
2008-08-13 15:26 ` David Tweed
@ 2008-08-13 16:10 ` Johan Herland
2008-08-13 17:38 ` Ken Pratt
1 sibling, 1 reply; 80+ messages in thread
From: Johan Herland @ 2008-08-13 16:10 UTC (permalink / raw)
To: git
Cc: Shawn O. Pearce, Jakub Narebski, Nicolas Pitre, Geert Bosch,
Andi Kleen, Ken Pratt
On Wednesday 13 August 2008, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
> > Nicolas Pitre <nico@cam.org> writes:
> > > On Tue, 12 Aug 2008, Geert Bosch wrote:
> > > > One nice optimization we could do for those pesky binary large
> > > > objects (like PDF, JPG and GZIP-ed data), is to detect such
> > > > files and revert to compression level 0. This should be
> > > > especially beneficial since already compressed data takes most
> > > > time to compress again.
> > >
> > > That would be a good thing indeed.
> >
> > Perhaps take a sample of some given size and calculate entropy in
> > it? Or just simply add gitattribute for per file compression
> > ratio...
>
> Estimating the entropy would make it "just magic". Most of Git is
> "just magic" so that's a good direction to take. I'm not familiar
> enough with the PDF/JPG/GZIP/ZIP stream formats to know what the
> first 4-8k looks like to know if it would give a good indication
> of being already compressed.
>
> Though I'd imagine looking at the first 4k should be sufficient
> for any compressed file. Having a header composed of 4k of _text_
> before binary compressed data would be nuts. Or a git-bundle with
> a large refs listing. ;-)
As for how to estimate entropy, isn't that just a matter of feeding it
through zlib and compare the output size to the input size? Especially
if we're already about to feed it through zlib anyway... In other
words, feed (an initial part of) the data through zlib, and if the
compression ratio so far looks good, keep going and write out the
compressed object, otherwise abort zlib and write out the original
object with compression level 0.
> Hence, "just magic" is probably the better route.
Agreed.
Have fun!
...Johan
--
Johan Herland, <johan@herland.net>
www.herland.net
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 15:50 ` Shawn O. Pearce
@ 2008-08-13 17:04 ` Nicolas Pitre
2008-08-13 17:19 ` Shawn O. Pearce
` (2 more replies)
0 siblings, 3 replies; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-13 17:04 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Geert Bosch, Andi Kleen, Ken Pratt, git
On Wed, 13 Aug 2008, Shawn O. Pearce wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > Well, we are talking about 50MB which is not that bad.
>
> I think we're closer to 100MB here due to the extra overheads
> I just alluded to above, and which weren't in your 104 byte
> per object figure.
Sure. That should still be workable on a machine with 256MB of RAM.
> > However there is a point where we should be realistic and just admit
> > that you need a sufficiently big machine if you have huge repositories
> > to deal with. Git should be fine serving pull requests with relatively
> > little memory usage, but anything else such as the initial repack simply
> > require enough RAM to be effective.
>
> Yea. But it would also be nice to be able to just concat packs
> together. Especially if the repository in question is an open source
> one and everything published is already known to be in the wild,
> as say it is also available over dumb HTTP. Yea, I know people
> like the 'security feature' of the packer not including objects
> which aren't reachable.
It is not only that, even if it is a point I consider important. If you
end up with 10 packs, it is likely that a base object in each of those
packs could simply be a delta against a single common base object, and
therefore the amount of data to transfer might be up to 10 times higher
than necessary.
> But how many times has Linus published something to his linux-2.6
> tree that he didn't mean to publish and had to rewind? I think
> that may be "never". Yet how many times per day does his tree get
> cloned from scratch?
That's not a good argument. Linus is a very disciplined git users,
probably more than average. We should not use that example to paper
over technical issues.
> This is also true for many internal corporate repositories.
> Users probably have full read access to the object database anyway,
> and maybe even have direct write access to it. Doing the object
> enumeration there is pointless as a security measure.
It is good for network bandwidth efficiency as I mentioned.
> I'm too busy to write a pack concat implementation proposal, so
> I'll just shutup now. But it wouldn't be hard if someone wanted
> to improve at least the initial clone serving case.
A much better solution would consist of finding just _why_ object
enumeration is so slow. This is indeed my biggest grip with git
performance at the moment.
|nico@xanadu:linux-2.6> time git rev-list --objects --all > /dev/null
|
|real 0m21.742s
|user 0m21.379s
|sys 0m0.360s
That's way too long for 1030198 objects (roughly 48k objects/sec). And
it gets even worse with the gcc repository:
|nico@xanadu:gcc> time git rev-list --objects --all > /dev/null
|
|real 1m51.591s
|user 1m50.757s
|sys 0m0.810s
That's for 1267993 objects, or about 11400 objects/sec.
Clearly something is not scaling here.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 16:01 ` Geert Bosch
@ 2008-08-13 17:13 ` Dana How
2008-08-13 17:26 ` Nicolas Pitre
1 sibling, 0 replies; 80+ messages in thread
From: Dana How @ 2008-08-13 17:13 UTC (permalink / raw)
To: Geert Bosch
Cc: Nicolas Pitre, Andi Kleen, Ken Pratt, Shawn O. Pearce, git,
danahow
Hi Geert,
I wrote the blob-size-threshold patch last year to which
Jakub Narebski referred.
I think there will eventually be a way to better handle large
objects in Git. Some possible elements:
* Loose objects have a format which can be streamed
directly into or out of packs. This avoids a round-trip through zlib,
which is a big deal for big objects. This was the effect of the "new"
loose object format to which Shawn referred. This was
removed apparently because it was ugly and/or difficult
to maintain, which I didn't understand since I didn't personally
suffer.
* Loose objects actually _are_ singleton packs, but saved
in .git/objects/xx. Workable, but would never happen due to
the extra pack header at the beginning it would add. This
takes advantage of the existing pack-to-pack streaming.
* Large loose objects are never deltified and/or never packed.
The latter was the focus of my patch.
* Large loose objects are placed in their own packs in .git/packs .
Doesn't work for me since I have too many large objects,
thus slowing down _all_ pack operations.
All this is complicated by the dual nature of packfiles --
they are used as a "wire format" for serial transmission,
as well as a database format for random access.
The "magic" entropy detection idea is cute, but probably not
needed -- using the blob size should be sufficient. Trying to
(re)compress an incompressible _smallish_ blob is probably
not worth trying to avoid, and any computation on sufficiently large
blobs should be avoided.
Hopefully I can return to this problem after New Year's. And
perhaps with the expanding Git userbase, more people will have
"large blob" problems ;-) and there will be more interest in
better addressing this usage pattern.
At the moment, I am thinking about how to better structure
git's handling of very large repositories in a team entirely
connected by high-speed LAN. It seems a method where
each user has a repository with deep history, but shallow
blobs, would be ideal, but that's also very different from
how git does things now.
Have fun,
Dana How
On Wed, Aug 13, 2008 at 9:01 AM, Geert Bosch <bosch@adacore.com> wrote:
> On Aug 13, 2008, at 10:35, Nicolas Pitre wrote:
>>
>> On Tue, 12 Aug 2008, Geert Bosch wrote:
>>
>>> I've always felt that keeping largish objects (say anything >1MB)
>>> loose makes perfect sense. These objects are accessed infrequently,
>>> often binary or otherwise poor candidates for the delta algorithm.
>>
>> Or, as I suggested in the past, they can be grouped into a separate
>> pack, or even occupy a pack of their own.
>
> This is fine, as long as we're not trying to create deltas
> of the large objects, or do other things that requires keeping
> the inflated data in memory.
>
>> As soon as you have more than
>> one revision of such largish objects then you lose again by keeping them
>> loose.
>
> Yes, you lose potentially in terms of disk space, but you avoid the
> large memory footprint during pack generation. For very large blobs,
> it is best to degenerate to having each revision of each file on
> its own (whether we call it a single-file pack, loose object or whatever).
> That way, the large file can stay immutable on disk, and will only
> need to be accessed during checkout. GIT will then scale with good
> performance until we run out of disk space.
>
> The alternative is that people need to keep large binary data out
> of their SCMs and handle it on the side. Consider a large web site
> where I have all scripts, HTML content, as well as a few movies
> to manage. The movies basically should be copied and stored, only
> to be accessed when a checkout (or push) is requested.
>
> If we mix the very large movies with the 100,000 objects representing
> the webpages, the resulting pack will become unwieldy and slow even
> to just copy around during repacks.
>
>> You'll have memory usage issues whenever such objects are accessed,
>> loose or not.
>
> Why? The only time we'd need to access their contents for checkout
> or when pushing across the network. These should all be steaming operations
> with small memory footprint.
>
>> However, once those big objects are packed once, they can
>> be repacked (or streamed over the net) without really "accessing" them.
>> Packed object data is simply copied into a new pack in that case which
>> is less of an issue on memory usage, irrespective of the original pack
>> size.
>
> Agreed, but still, at least very large objects. If I have a 600MB
> file in my repository, it should just not get in the way. If it gets
> copied around during each repack, that just wastes I/O time for no
> good reason. Even worse, it causes incremental backups or filesystem
> checkpoints to become way more expensive. Just leaving large files
> alone as immutable objects on disk avoids all these issues.
>
> -Geert
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Dana L. How danahow@gmail.com +1 650 804 5991 cell
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 17:04 ` Nicolas Pitre
@ 2008-08-13 17:19 ` Shawn O. Pearce
2008-08-14 6:33 ` Andreas Ericsson
2008-08-14 17:21 ` Linus Torvalds
2 siblings, 0 replies; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-13 17:19 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Geert Bosch, Andi Kleen, Ken Pratt, git
Nicolas Pitre <nico@cam.org> wrote:
> On Wed, 13 Aug 2008, Shawn O. Pearce wrote:
> > Doing the object
> > enumeration is pointless as a security measure.
>
> It is good for network bandwidth efficiency as I mentioned.
The network bandwidth efficiency is the most valid argument for
the enumeration.
> > I'm too busy to write a pack concat implementation proposal
>
> A much better solution would consist of finding just _why_ object
> enumeration is so slow. This is indeed my biggest grip with git
> performance at the moment.
...
> |nico@xanadu:gcc> time git rev-list --objects --all > /dev/null
> |
> |real 1m51.591s
> |user 1m50.757s
> |sys 0m0.810s
>
> That's for 1267993 objects, or about 11400 objects/sec.
>
> Clearly something is not scaling here.
Yikes. Last time I was looking at this sort of thing I think we
spent around 60% of our time dealing with inflating, patching and
parsing commit and tree objects. pack v4's formatting spawned
out of that particular point, but we never really finished that.
Its been years so I can't trust my memory enough to say pack v4 is
the solution to this, without redoing the profiling. But I think
that is what one would find.
Though the decreasing objects/sec rate with increased total number
of objects suggets the object hash isn't scaling.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 16:01 ` Geert Bosch
2008-08-13 17:13 ` Dana How
@ 2008-08-13 17:26 ` Nicolas Pitre
1 sibling, 0 replies; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-13 17:26 UTC (permalink / raw)
To: Geert Bosch; +Cc: Andi Kleen, Ken Pratt, Shawn O. Pearce, git
On Wed, 13 Aug 2008, Geert Bosch wrote:
> On Aug 13, 2008, at 10:35, Nicolas Pitre wrote:
> > On Tue, 12 Aug 2008, Geert Bosch wrote:
> >
> > > I've always felt that keeping largish objects (say anything >1MB)
> > > loose makes perfect sense. These objects are accessed infrequently,
> > > often binary or otherwise poor candidates for the delta algorithm.
> >
> > Or, as I suggested in the past, they can be grouped into a separate
> > pack, or even occupy a pack of their own.
>
> This is fine, as long as we're not trying to create deltas
> of the large objects, or do other things that requires keeping
> the inflated data in memory.
First, there is the delta attribute:
|commit a74db82e15cd8a2c53a4a83e9a36dc7bf7a4c750
|Author: Junio C Hamano <junkio@cox.net>
|Date: Sat May 19 00:39:31 2007 -0700
|
| Teach "delta" attribute to pack-objects.
|
| This teaches pack-objects to use .gitattributes mechanism so
| that the user can specify certain blobs are not worth spending
| CPU cycles to attempt deltification.
|
| The name of the attrbute is "delta", and when it is set to
| false, like this:
|
| == .gitattributes ==
| *.jpg -delta
|
| they are always stored in the plain-compressed base object
| representation.
This could probably be extended to take a size limit argument as well.
> > As soon as you have more than
> > one revision of such largish objects then you lose again by keeping them
> > loose.
>
> Yes, you lose potentially in terms of disk space, but you avoid the
> large memory footprint during pack generation. For very large blobs,
> it is best to degenerate to having each revision of each file on
> its own (whether we call it a single-file pack, loose object or whatever).
> That way, the large file can stay immutable on disk, and will only
> need to be accessed during checkout. GIT will then scale with good
> performance until we run out of disk space.
Loose objects, though, will always be selected for potential delta
generation. Packed objects, deltified or not, are always streamed as is
when serving pull requests. And by default delta compression is not
(re)attempted between objects which are part of the same pack, the
reason being that if they were not deltified on the first packing
attempt then there is no point trying again when streaming them over the
net. So you always benefit from having your large objects packed with
the rest. This, plus the delta prevention mechanism above should cover
most cases.
> > You'll have memory usage issues whenever such objects are accessed,
> > loose or not.
> Why? The only time we'd need to access their contents for checkout
> or when pushing across the network. These should all be steaming operations
> with small memory footprint.
Pushing across the network, or repacking without -f, is streamed.
Checking out currently isn't (although it probably could). Repacking
with -f definitely isn't and probably shouldn't because of complexity
issues.
> > However, once those big objects are packed once, they can
> > be repacked (or streamed over the net) without really "accessing" them.
> > Packed object data is simply copied into a new pack in that case which
> > is less of an issue on memory usage, irrespective of the original pack
> > size.
> Agreed, but still, at least very large objects. If I have a 600MB
> file in my repository, it should just not get in the way. If it gets
> copied around during each repack, that just wastes I/O time for no
> good reason. Even worse, it causes incremental backups or filesystem
> checkpoints to become way more expensive. Just leaving large files
> alone as immutable objects on disk avoids all these issues.
Pack them in a pack of their own and stick a .keep file along with it.
At that point they will never be rewritten.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 16:10 ` Johan Herland
@ 2008-08-13 17:38 ` Ken Pratt
2008-08-13 17:57 ` Nicolas Pitre
0 siblings, 1 reply; 80+ messages in thread
From: Ken Pratt @ 2008-08-13 17:38 UTC (permalink / raw)
To: Johan Herland
Cc: git, Shawn O. Pearce, Jakub Narebski, Nicolas Pitre, Geert Bosch,
Andi Kleen
> As for how to estimate entropy, isn't that just a matter of feeding it
> through zlib and compare the output size to the input size? Especially
> if we're already about to feed it through zlib anyway... In other
> words, feed (an initial part of) the data through zlib, and if the
> compression ratio so far looks good, keep going and write out the
> compressed object, otherwise abort zlib and write out the original
> object with compression level 0.
This if probably off topic now, but as the OP, I'd like to mention
that I tried setting pack.compression = 0 and it did not solve my
memory issues. So it seems to be that the packing itself that is
sucking up all the memory -- not the compression.
Thanks for all the insightful replies!
-Ken
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 17:38 ` Ken Pratt
@ 2008-08-13 17:57 ` Nicolas Pitre
0 siblings, 0 replies; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-13 17:57 UTC (permalink / raw)
To: Ken Pratt
Cc: Johan Herland, git, Shawn O. Pearce, Jakub Narebski, Geert Bosch,
Andi Kleen
On Wed, 13 Aug 2008, Ken Pratt wrote:
> > As for how to estimate entropy, isn't that just a matter of feeding it
> > through zlib and compare the output size to the input size? Especially
> > if we're already about to feed it through zlib anyway... In other
> > words, feed (an initial part of) the data through zlib, and if the
> > compression ratio so far looks good, keep going and write out the
> > compressed object, otherwise abort zlib and write out the original
> > object with compression level 0.
>
> This is probably off topic now, but as the OP, I'd like to mention
> that I tried setting pack.compression = 0 and it did not solve my
> memory issues.
Yeah, the compression level is a tengential issue which has to do with
speed.
> So it seems to be that the packing itself that is
> sucking up all the memory -- not the compression.
Initial packing requires enough memory. And if your repository is not
packed, then every clone request will act just like a first packing. So
for git on a server to behave well, repositories have to be well packed.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 15:26 ` David Tweed
@ 2008-08-13 23:54 ` Martin Langhoff
2008-08-14 9:04 ` David Tweed
0 siblings, 1 reply; 80+ messages in thread
From: Martin Langhoff @ 2008-08-13 23:54 UTC (permalink / raw)
To: David Tweed
Cc: Shawn O. Pearce, Jakub Narebski, Nicolas Pitre, Geert Bosch,
Andi Kleen, Ken Pratt, git
On Thu, Aug 14, 2008 at 3:26 AM, David Tweed <david.tweed@gmail.com> wrote:
> FWIW, PDF format is a mix of sections of uncompressed higher level
> ASCII notation and sections of compressed actual glyph/location data
The PDF spec allows compression of the "text" sections - if a PDF is
uncompressed, it's a good candidate for delta & compression.
Unfortunately, within the same file you might have an embedded JPEG.
cheers,
m
--
martin.langhoff@gmail.com
martin@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 17:04 ` Nicolas Pitre
2008-08-13 17:19 ` Shawn O. Pearce
@ 2008-08-14 6:33 ` Andreas Ericsson
2008-08-14 10:04 ` Thomas Rast
2008-08-14 14:01 ` Nicolas Pitre
2008-08-14 17:21 ` Linus Torvalds
2 siblings, 2 replies; 80+ messages in thread
From: Andreas Ericsson @ 2008-08-14 6:33 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
Nicolas Pitre wrote:
> On Wed, 13 Aug 2008, Shawn O. Pearce wrote:
>
>> Nicolas Pitre <nico@cam.org> wrote:
>>> Well, we are talking about 50MB which is not that bad.
>> I think we're closer to 100MB here due to the extra overheads
>> I just alluded to above, and which weren't in your 104 byte
>> per object figure.
>
> Sure. That should still be workable on a machine with 256MB of RAM.
>
>>> However there is a point where we should be realistic and just admit
>>> that you need a sufficiently big machine if you have huge repositories
>>> to deal with. Git should be fine serving pull requests with relatively
>>> little memory usage, but anything else such as the initial repack simply
>>> require enough RAM to be effective.
>> Yea. But it would also be nice to be able to just concat packs
>> together. Especially if the repository in question is an open source
>> one and everything published is already known to be in the wild,
>> as say it is also available over dumb HTTP. Yea, I know people
>> like the 'security feature' of the packer not including objects
>> which aren't reachable.
>
> It is not only that, even if it is a point I consider important. If you
> end up with 10 packs, it is likely that a base object in each of those
> packs could simply be a delta against a single common base object, and
> therefore the amount of data to transfer might be up to 10 times higher
> than necessary.
>
[cut]
>> This is also true for many internal corporate repositories.
>> Users probably have full read access to the object database anyway,
>> and maybe even have direct write access to it. Doing the object
>> enumeration there is pointless as a security measure.
>
> It is good for network bandwidth efficiency as I mentioned.
>
As a corporate git user, I can say that I'm very rarely worried
about how much data gets sent over our in-office gigabit network.
My primary concern wrt server side git is cpu- and IO-heavy
operations, as we run the entire machine in a vmware guest os
which just plain sucks at such things.
With that in mind, a config variable in /etc/gitconfig would
work wonderfully for that situation, as our central watering
hole only ever serves locally.
>> I'm too busy to write a pack concat implementation proposal, so
>> I'll just shutup now. But it wouldn't be hard if someone wanted
>> to improve at least the initial clone serving case.
>
> A much better solution would consist of finding just _why_ object
> enumeration is so slow. This is indeed my biggest grip with git
> performance at the moment.
>
> |nico@xanadu:linux-2.6> time git rev-list --objects --all > /dev/null
> |
> |real 0m21.742s
> |user 0m21.379s
> |sys 0m0.360s
>
> That's way too long for 1030198 objects (roughly 48k objects/sec). And
> it gets even worse with the gcc repository:
>
> |nico@xanadu:gcc> time git rev-list --objects --all > /dev/null
> |
> |real 1m51.591s
> |user 1m50.757s
> |sys 0m0.810s
>
> That's for 1267993 objects, or about 11400 objects/sec.
>
> Clearly something is not scaling here.
>
What are the different packing options for the two repositories?
A longer deltachain and larger packwindow would increase the
enumeration time, wouldn't it?
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 23:54 ` Martin Langhoff
@ 2008-08-14 9:04 ` David Tweed
0 siblings, 0 replies; 80+ messages in thread
From: David Tweed @ 2008-08-14 9:04 UTC (permalink / raw)
To: Martin Langhoff
Cc: Shawn O. Pearce, Jakub Narebski, Nicolas Pitre, Geert Bosch,
Andi Kleen, Ken Pratt, git
On Thu, Aug 14, 2008 at 12:54 AM, Martin Langhoff
<martin.langhoff@gmail.com> wrote:
> On Thu, Aug 14, 2008 at 3:26 AM, David Tweed <david.tweed@gmail.com> wrote:
>> FWIW, PDF format is a mix of sections of uncompressed higher level
>> ASCII notation and sections of compressed actual glyph/location data
>
> The PDF spec allows compression of the "text" sections - if a PDF is
> uncompressed, it's a good candidate for delta & compression.
> Unfortunately, within the same file you might have an embedded JPEG.
Sure, all I was pointing out was that even pdfs with compressed page
contents can look like uncompressed text from looking at the entropy
of the first 4k or 8k.
--
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"while having code so boring anyone can maintain it, use Python." --
attempted insult seen on slashdot
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 6:33 ` Andreas Ericsson
@ 2008-08-14 10:04 ` Thomas Rast
2008-08-14 10:15 ` Andreas Ericsson
2008-08-14 14:01 ` Nicolas Pitre
1 sibling, 1 reply; 80+ messages in thread
From: Thomas Rast @ 2008-08-14 10:04 UTC (permalink / raw)
To: Andreas Ericsson
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
[-- Attachment #1: Type: text/plain, Size: 2161 bytes --]
Andreas Ericsson wrote:
> Nicolas Pitre wrote:
> > |nico@xanadu:linux-2.6> time git rev-list --objects --all > /dev/null
> > |
> > |real 0m21.742s
> > |user 0m21.379s
> > |sys 0m0.360s
> >
> > That's way too long for 1030198 objects (roughly 48k objects/sec). And
> > it gets even worse with the gcc repository:
> >
> > |nico@xanadu:gcc> time git rev-list --objects --all > /dev/null
> > |
> > |real 1m51.591s
> > |user 1m50.757s
> > |sys 0m0.810s
> >
> > That's for 1267993 objects, or about 11400 objects/sec.
> >
> > Clearly something is not scaling here.
> >
>
> What are the different packing options for the two repositories?
> A longer deltachain and larger packwindow would increase the
> enumeration time, wouldn't it?
For the fun of it, I ran a test without deltas. Here's my normal
git.git:
$ du -h .git/objects/pack
26M .git/objects/pack
$ git rev-list --all | wc -l
17638
$ git rev-list --all --objects | wc -l
82194
On a hot cache I get about 61800 objects/sec:
$ /usr/bin/time git rev-list --all --objects >/dev/null
1.33user 0.04system 0:01.39elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8087minor)pagefaults 0swaps
I then made a copy of that and repacked it without deltas (remember to
remove *.keep, I tripped over that twice):
$ git repack --depth=0 --window=0 -a -f -d
Counting objects: 82906, done.
Writing objects: 100% (82906/82906), done.
Total 82906 (delta 0), reused 0 (delta 0)
$ du -h .git/objects/pack
339M .git/objects/pack
Which results in only 28739 objects/sec:
$ /usr/bin/time git rev-list --all --objects >/dev/null
2.86user 0.11system 0:02.98elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+50162minor)pagefaults 0swaps
So maybe the GCC repository would need to be packed _better_?
Unfortunately I cannot sensibly run the same test on linux-2.6.git,
which is the next bigger git I have around: it inflates to about 3GB
after the repack, which does not fit into memory.
- Thomas
--
Thomas Rast
trast@student.ethz.ch
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 10:04 ` Thomas Rast
@ 2008-08-14 10:15 ` Andreas Ericsson
2008-08-14 22:33 ` Shawn O. Pearce
0 siblings, 1 reply; 80+ messages in thread
From: Andreas Ericsson @ 2008-08-14 10:15 UTC (permalink / raw)
To: Thomas Rast
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
Thomas Rast wrote:
> Andreas Ericsson wrote:
>> Nicolas Pitre wrote:
>>> |nico@xanadu:linux-2.6> time git rev-list --objects --all > /dev/null
>>> |
>>> |real 0m21.742s
>>> |user 0m21.379s
>>> |sys 0m0.360s
>>>
>>> That's way too long for 1030198 objects (roughly 48k objects/sec). And
>>> it gets even worse with the gcc repository:
>>>
>>> |nico@xanadu:gcc> time git rev-list --objects --all > /dev/null
>>> |
>>> |real 1m51.591s
>>> |user 1m50.757s
>>> |sys 0m0.810s
>>>
>>> That's for 1267993 objects, or about 11400 objects/sec.
>>>
>>> Clearly something is not scaling here.
>>>
>> What are the different packing options for the two repositories?
>> A longer deltachain and larger packwindow would increase the
>> enumeration time, wouldn't it?
>
> For the fun of it, I ran a test without deltas. Here's my normal
> git.git:
>
> $ du -h .git/objects/pack
> 26M .git/objects/pack
> $ git rev-list --all | wc -l
> 17638
> $ git rev-list --all --objects | wc -l
> 82194
>
> On a hot cache I get about 61800 objects/sec:
>
> $ /usr/bin/time git rev-list --all --objects >/dev/null
> 1.33user 0.04system 0:01.39elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+8087minor)pagefaults 0swaps
>
> I then made a copy of that and repacked it without deltas (remember to
> remove *.keep, I tripped over that twice):
>
> $ git repack --depth=0 --window=0 -a -f -d
> Counting objects: 82906, done.
> Writing objects: 100% (82906/82906), done.
> Total 82906 (delta 0), reused 0 (delta 0)
> $ du -h .git/objects/pack
> 339M .git/objects/pack
>
> Which results in only 28739 objects/sec:
>
Well, if the objects are, on average, >twice the size, would that
explain it? I'd hate to see some of the sharper git minds hop off
on a wild goose chase if it's not necessary.
How does one go about getting the object sizes? rev-list appears
to have no option for it.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 6:33 ` Andreas Ericsson
2008-08-14 10:04 ` Thomas Rast
@ 2008-08-14 14:01 ` Nicolas Pitre
1 sibling, 0 replies; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-14 14:01 UTC (permalink / raw)
To: Andreas Ericsson; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Andreas Ericsson wrote:
> As a corporate git user, I can say that I'm very rarely worried
> about how much data gets sent over our in-office gigabit network.
> My primary concern wrt server side git is cpu- and IO-heavy
> operations, as we run the entire machine in a vmware guest os
> which just plain sucks at such things.
In the general case, the amount of data sent over the network is
directly proportional to disk IO.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-13 17:04 ` Nicolas Pitre
2008-08-13 17:19 ` Shawn O. Pearce
2008-08-14 6:33 ` Andreas Ericsson
@ 2008-08-14 17:21 ` Linus Torvalds
2008-08-14 17:58 ` Linus Torvalds
2008-08-14 18:38 ` Nicolas Pitre
2 siblings, 2 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-08-14 17:21 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Wed, 13 Aug 2008, Nicolas Pitre wrote:
>
> A much better solution would consist of finding just _why_ object
> enumeration is so slow. This is indeed my biggest grip with git
> performance at the moment.
>
> |nico@xanadu:linux-2.6> time git rev-list --objects --all > /dev/null
> |
> |real 0m21.742s
> |user 0m21.379s
> |sys 0m0.360s
>
> That's way too long for 1030198 objects (roughly 48k objects/sec).
Why do you think that's horribly slow?
Doing a rev-list of all objects is a fairly rare operation, but even if
you want to clone/repack all of your archives the whole time, please
realize that listing objects is _not_ a simple operation. It opens up and
parses every single tree in the whole history. That's a _lot_ of data to
unpack.
And trees also pack very efficiently (because they delta so well), so
there's a lot of complex ops there.
> And it gets even worse with the gcc repository:
I bet it's because gcc has a different directory structure. I don't have
the gcc sources in front of me, but I'd suspect something like a single
large directory or other.
> Clearly something is not scaling here.
I don't agree. There's no "clearly" about it. Different data sets.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 17:21 ` Linus Torvalds
@ 2008-08-14 17:58 ` Linus Torvalds
2008-08-14 19:04 ` Nicolas Pitre
2008-08-14 18:38 ` Nicolas Pitre
1 sibling, 1 reply; 80+ messages in thread
From: Linus Torvalds @ 2008-08-14 17:58 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Linus Torvalds wrote:
>
> Doing a rev-list of all objects is a fairly rare operation, but even if
> you want to clone/repack all of your archives the whole time, please
> realize that listing objects is _not_ a simple operation. It opens up and
> parses every single tree in the whole history. That's a _lot_ of data to
> unpack.
Btw, it's not that hard to run oprofile (link git statically to get better
numbers). For me, the answer to what is going on for a kernel rev-list is
pretty straightforward:
263742 26.6009 lookup_object
135945 13.7113 inflate
110525 11.1475 inflate_fast
75124 7.5770 inflate_table
64676 6.5232 strlen
48635 4.9053 memcpy
47744 4.8154 find_pack_entry_one
35265 3.5568 _int_malloc
31579 3.1850 decode_tree_entry
28388 2.8632 adler32
19441 1.9608 process_tree
10398 1.0487 patch_delta
8925 0.9002 _int_free
..
so most of it is in inflate, but I suspect the cost of "lookup_object()"
is so high becuase when we parse the trees we also have to look up every
blob - even if they didn't change - just to see whether we already saw it
or not.
For me, an instruction-level profile of lookup_object() shows that the
cost is all in the hashcmp (53% of the profile is on that "repz cmpsb")
and in the loading of the object pointer (26% of the profile is on the
test instruction after the "obj_hash[i]" load). I don't think we can
really improve that code much - the hash table is very efficient, and the
cost is just in the fact that we have a lot of meory accesses.
We could try to use the (more memory-hungry) "hash.c" implementation for
object hashing, which actually includes a 32-bit key inside the hash
table, but while that will avoid the cost of fetching the object pointer
for the cases where we have collisions, most of the time the cost is not
in the collision, but in the fact that we _hit_.
I bet the hit percentage is 90+%, and the cost really is just that we
encounter the same object hundreds or thousands of times.
Please realize that even if there may be "only" a million objects in the
kernel, there are *MANY* more ways to _reach_ those objects, and that is
what git-rev-list --objects does! It's not O(number-of-objects), it's
O(number-of-object-linkages).
For my current kernel archive, for example, the number of objects is
roughly 900k. However, think about how many times we'll actually reach a
blob: that's roughly (blobs per commit)*(number of commits), which can be
approximated with
echo $(( $(git ls-files | wc -l) * $(git rev-list --all | wc -l) ))
which is 24324*108518=2639591832 ie about 2.5 _billion_ times.
Now, we don't actually do anything close to that many lookups, because
when a subdirectory doesn't change at all, we'll skip the whole tree after
having seen it just once, so that will cut down on the number of objects
we have to look up by probably a couple of orders of magnitude.
But this is why the "one large directory" load performs worse: in the
worst case, if you really have a totally flat directory tree, you'd
literally see that 2.5 billion object lookup case.
So it's not that git scales badly. It's that "git rev-list --objects" is
really a very expensive operation, and while some good practices (deep
directory structures) makes it able to optimize the load away a lot, it's
still potentially very tough.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 17:21 ` Linus Torvalds
2008-08-14 17:58 ` Linus Torvalds
@ 2008-08-14 18:38 ` Nicolas Pitre
2008-08-14 18:55 ` Linus Torvalds
1 sibling, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-14 18:38 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Linus Torvalds wrote:
>
>
> On Wed, 13 Aug 2008, Nicolas Pitre wrote:
> >
> > A much better solution would consist of finding just _why_ object
> > enumeration is so slow. This is indeed my biggest grip with git
> > performance at the moment.
> >
> > |nico@xanadu:linux-2.6> time git rev-list --objects --all > /dev/null
> > |
> > |real 0m21.742s
> > |user 0m21.379s
> > |sys 0m0.360s
> >
> > That's way too long for 1030198 objects (roughly 48k objects/sec).
>
> Why do you think that's horribly slow?
Call it gut feeling. Or 60% CPU wasted in zlib.
> Doing a rev-list of all objects is a fairly rare operation, but even if
> you want to clone/repack all of your archives the whole time, please
> realize that listing objects is _not_ a simple operation. It opens up and
> parses every single tree in the whole history. That's a _lot_ of data to
> unpack.
I disagree. Well, right _now_ it is not a simple operation. But if you
remember, I'm one of the co-investigator of the pack v4 format which
goal is to make history and tree walking much much cheaper, while making
their packed representation denser too. Even with early prototypes of
the format with the overhead of converting objects back into the current
format on the fly in unpack_entry() the object enumeration was _faster_
than current git.
So this might just be what was needed to bring back some incentive
behind the pack v4 effort.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 18:38 ` Nicolas Pitre
@ 2008-08-14 18:55 ` Linus Torvalds
0 siblings, 0 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-08-14 18:55 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Nicolas Pitre wrote:
>
> I disagree. Well, right _now_ it is not a simple operation. But if you
> remember, I'm one of the co-investigator of the pack v4 format which
> goal is to make history and tree walking much much cheaper, while making
> their packed representation denser too.
See my other email with profile data and explanation.
Yes, zlib is high up, but it's not dominant to the point where a packfile
format change would maek a huge difference. You'd still need deltas for
trees, so even if you replaced zlib with something else, you'd still get a
large hit.
You do realize that a lot of the zlib costs are due to cache misses, not
zlib being fundamentally expensive in itself, right? Even if you made the
zlib CPU costs be zero, you still couldn't avoid the _biggest_ cost.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 17:58 ` Linus Torvalds
@ 2008-08-14 19:04 ` Nicolas Pitre
2008-08-14 19:44 ` Linus Torvalds
0 siblings, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-14 19:04 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Linus Torvalds wrote:
> Btw, it's not that hard to run oprofile (link git statically to get better
> numbers). For me, the answer to what is going on for a kernel rev-list is
> pretty straightforward:
>
> 263742 26.6009 lookup_object
> 135945 13.7113 inflate
> 110525 11.1475 inflate_fast
> 75124 7.5770 inflate_table
> 64676 6.5232 strlen
> 48635 4.9053 memcpy
> 47744 4.8154 find_pack_entry_one
> 35265 3.5568 _int_malloc
> 31579 3.1850 decode_tree_entry
> 28388 2.8632 adler32
> 19441 1.9608 process_tree
> 10398 1.0487 patch_delta
> 8925 0.9002 _int_free
> ..
OK, inflate went down since last time I profiled this, but that's
probably because lookup_object went up.
> so most of it is in inflate,
Which, again, would be eliminated entirely by pack v4.
> but I suspect the cost of "lookup_object()"
> is so high becuase when we parse the trees we also have to look up every
> blob - even if they didn't change - just to see whether we already saw it
> or not.
One optimization with pack v4 was to have delta chunks aligned on tree
records, and because tree objects are no longer compressed, parsing a
tree object could be done by simply walking the delta chain directly.
Then, another optimization would consist of simply skipping any part of
a tree object making a delta reference to a base object which has
already been parsed which would avoid a large bunch of lookup_object()
calls too.
And because
delta base objects are normally seen first in recency order then this
would reduce the combinatorial complexity significantly.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 19:04 ` Nicolas Pitre
@ 2008-08-14 19:44 ` Linus Torvalds
2008-08-14 21:30 ` Andi Kleen
2008-08-14 21:50 ` Nicolas Pitre
0 siblings, 2 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-08-14 19:44 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Nicolas Pitre wrote:
>
> > so most of it is in inflate,
>
> Which, again, would be eliminated entirely by pack v4.
I seriously doubt that.
Nico, it's really easy to say "I wave my magic wand and nothing remains".
It's hard to actually _do_.
> One optimization with pack v4 was to have delta chunks aligned on tree
> records, and because tree objects are no longer compressed, parsing a
> tree object could be done by simply walking the delta chain directly.
Even if you do that, please take a look at the performance characteristics
of modern CPU's.
Here's a hint: the cost of a cache miss is generally about a hundred times
the cost of just about anything else.
So to make a convincing argument, you'd have to show that the actual
memory access patterns are also much better.
No, zlib isn't perfect, and nope, inflate_fast() is no "memcpy()". And
yes, I'm sure a pure memcpy would be much faster. But I seriously suspect
that a lot of the cost is literally in bringing in the source data to the
CPU. Because we just mmap() the whole pack-file, the first access to the
data is going to see the cost of the cache misses.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 19:44 ` Linus Torvalds
@ 2008-08-14 21:30 ` Andi Kleen
2008-08-15 16:15 ` Linus Torvalds
2008-08-14 21:50 ` Nicolas Pitre
1 sibling, 1 reply; 80+ messages in thread
From: Andi Kleen @ 2008-08-14 21:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
> Here's a hint: the cost of a cache miss is generally about a hundred times
100 times seems quite optimistic %)
>
> No, zlib isn't perfect, and nope, inflate_fast() is no "memcpy()". And
> yes, I'm sure a pure memcpy would be much faster. But I seriously suspect
> that a lot of the cost is literally in bringing in the source data to the
> CPU. Because we just mmap() the whole pack-file, the first access to the
> data is going to see the cost of the cache misses.
I would have thought that zlib has a sequential access pattern that the
CPU prefetchers have a easy time with hiding latency.
BTW I always wonder why people reason about cache misses in oprofile
logs without actually using the cache miss counters.
-Andi
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 19:44 ` Linus Torvalds
2008-08-14 21:30 ` Andi Kleen
@ 2008-08-14 21:50 ` Nicolas Pitre
2008-08-14 23:14 ` Linus Torvalds
1 sibling, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-14 21:50 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Linus Torvalds wrote:
> Here's a hint: the cost of a cache miss is generally about a hundred times
> the cost of just about anything else.
>
> So to make a convincing argument, you'd have to show that the actual
> memory access patterns are also much better.
>
> No, zlib isn't perfect, and nope, inflate_fast() is no "memcpy()". And
> yes, I'm sure a pure memcpy would be much faster. But I seriously suspect
> that a lot of the cost is literally in bringing in the source data to the
> CPU. Because we just mmap() the whole pack-file, the first access to the
> data is going to see the cost of the cache misses.
Possible. However, the fact that both the "Compressing objects" and the
"Writing objects" phases during a repack (without -f) together are
_faster_ than the "Counting objects" phase is a sign that something is
more significant than cache misses here, especially when tree
information is a small portion of the total pack data size.
Of course we can do further profiling, say with core.compression set to
0 and a full repack, or even hacking the pack-objects code to force a
compression level of 0 for tree objects, and possibly commits too since
pack v4 intend to deflate only the log text). Tree objects delta very
well, but they don't deflate well at all.
OK, so I did, and the quick test for the kernel is:
|nico@xanadu:linux-2.6> time git rev-list --all --objects > /dev/null
|
|real 0m14.737s
|user 0m14.432s
|sys 0m0.296s
That's for 1031404 objects, hence we're now talking around 70k
objects/sec instead of 48k objects/sec. _Only_ by removing zlib out of
the equation despite the fact that the pack is now larger. So I bet
that additional improvements from pack v4 could improve things even
more, including the object lookup avoidance optimization I mentioned
previously.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 10:15 ` Andreas Ericsson
@ 2008-08-14 22:33 ` Shawn O. Pearce
2008-08-15 1:46 ` Nicolas Pitre
0 siblings, 1 reply; 80+ messages in thread
From: Shawn O. Pearce @ 2008-08-14 22:33 UTC (permalink / raw)
To: Andreas Ericsson
Cc: Thomas Rast, Nicolas Pitre, Geert Bosch, Andi Kleen, Ken Pratt,
git
Andreas Ericsson <ae@op5.se> wrote:
> How does one go about getting the object sizes? rev-list appears
> to have no option for it.
With great pain. You can use the output of verify-pack -v to
tell you the size of the inflated portion of the object, but for
a delta this is the inflated size of the delta, not of the fully
unpacked object.
--
Shawn.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 21:50 ` Nicolas Pitre
@ 2008-08-14 23:14 ` Linus Torvalds
2008-08-14 23:39 ` Björn Steinbrink
2008-08-16 0:34 ` Linus Torvalds
0 siblings, 2 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-08-14 23:14 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Nicolas Pitre wrote:
>
> Possible. However, the fact that both the "Compressing objects" and the
> "Writing objects" phases during a repack (without -f) together are
> _faster_ than the "Counting objects" phase is a sign that something is
> more significant than cache misses here, especially when tree
> information is a small portion of the total pack data size.
Hmm. I think I may have clue.
The size of the delta cache seems to be a sensitive parameter for this
thing. Not so much for the git archive, but working on the kernel tree,
raising it to 1024 seems to give a 20% performance improvement. That, in
turn, implies that we may be unpacking things over and over again because
of bad locality wrt delta generation.
I'm not sure how easy something like that is to fix, though. We generate
the object list in "recency" order for a reason, but that also happens to
be the worst possible order for re-using the delta cache - by the time we
get back to the next version of some tree entry, we'll have cycled through
all the other trees, and blown all the caches, so we'll end up likely
re-doing the whole delta chain.
So it's quite possible that what ends up happening is that some directory
with a deep delta chain will basically end up unpacking the whole chain -
which obviously includes inflating each delta - over and over again.
That's what the delta cache was supposed to avoid..
Looking at some call graphs, for the kernel I get:
- process_tree() called 10 million times
- causing parse_tree() called 479,466 times (whew, so 19 out of 20 trees
have already been seen and can be discarded)
- which in turn calls read_sha1_file() (total: 588,110 times, but there's
a hundred thousand+ commits)
but that actually causes
- 588,110 cals to cache_or_unpack_entry
out of which 5,850 calls hit in the cache, and 582,260 do *not*.
IOW, the delta cache effectively never triggers because the working set is
_way_ bigger than the cache, and the patterns aren't good. So since most
trees are deltas, and the max delta depth is 10, the average depth is
soemthing like 5, and we actually get an ugly
- 1,637,999 calls to unpack_compressed_entry
which all results in a zlib inflate call.
So we actually have three times as many calls to inflate as we even have
objects parsed, due to the delta chains on the trees (the commits almost
never delta-chain at all, much less any deeper than a couple of entries).
So yeah, trees are the problem here, and yes, avoiding inflating them
would help - but mainly because we do it something like four times per
object on average!
Ouch. But we really can't just make the cache bigger, and the bad access
patterns really are on purpose here. The delta cache was not meant for
this, it was really meant for the "dig deeper into the history of a single
file" kind of situation that gets very different patterns indeed.
I'll see if I can think of anything simple to avoid all this unnecessary
work. But it doesn't look too good.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 23:14 ` Linus Torvalds
@ 2008-08-14 23:39 ` Björn Steinbrink
2008-08-15 0:06 ` Linus Torvalds
2008-08-16 0:34 ` Linus Torvalds
1 sibling, 1 reply; 80+ messages in thread
From: Björn Steinbrink @ 2008-08-14 23:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
On 2008.08.14 16:14:26 -0700, Linus Torvalds wrote:
>
> On Thu, 14 Aug 2008, Nicolas Pitre wrote:
> >
> > Possible. However, the fact that both the "Compressing objects" and the
> > "Writing objects" phases during a repack (without -f) together are
> > _faster_ than the "Counting objects" phase is a sign that something is
> > more significant than cache misses here, especially when tree
> > information is a small portion of the total pack data size.
>
> Hmm. I think I may have clue.
>
> The size of the delta cache seems to be a sensitive parameter for this
> thing. Not so much for the git archive, but working on the kernel tree,
> raising it to 1024 seems to give a 20% performance improvement. That, in
> turn, implies that we may be unpacking things over and over again because
> of bad locality wrt delta generation.
Since you mention the delta cache, uau (no idea about his real name) on
#git was talking about some delta cache optimizations lately, although
he was dealing with "git log -S", maybe it affects rev-list in a similar
way. Unfortunately, I can't seem to find any code for that, just a
description of what he did and some numbers on the results in the IRC
logs.
http://colabti.org/irclogger/irclogger_log/git?date=2008-08-04,Mon#l65
Maybe that helps in some way.
Björn
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 23:39 ` Björn Steinbrink
@ 2008-08-15 0:06 ` Linus Torvalds
2008-08-15 0:25 ` Linus Torvalds
2008-08-16 12:47 ` Björn Steinbrink
0 siblings, 2 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-08-15 0:06 UTC (permalink / raw)
To: Björn Steinbrink
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
On Fri, 15 Aug 2008, Björn Steinbrink wrote:
>
> Since you mention the delta cache, uau (no idea about his real name) on
> #git was talking about some delta cache optimizations lately, although
> he was dealing with "git log -S", maybe it affects rev-list in a similar
> way. Unfortunately, I can't seem to find any code for that, just a
> description of what he did and some numbers on the results in the IRC
> logs.
Yes, interesting.
The delta cache was really a huge hack that just turned out rather
successful. It's been hacked on further since (to do some half-way
reasonable replacement with _another_ hack by adding an LRU on top of it),
but it really is very hacky indeed.
The "hash" we use for looking things up is also pretty much a joke, and it
has no overflow capability, it just replaces the old entry with a new one.
I wonder how hard it would be to replace the whole table thing with our
generic hash.c hash thing. I'll take a look.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-15 0:06 ` Linus Torvalds
@ 2008-08-15 0:25 ` Linus Torvalds
2008-08-16 12:47 ` Björn Steinbrink
1 sibling, 0 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-08-15 0:25 UTC (permalink / raw)
To: Björn Steinbrink
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
On Thu, 14 Aug 2008, Linus Torvalds wrote:
>
> I wonder how hard it would be to replace the whole table thing with our
> generic hash.c hash thing. I'll take a look.
Ok, I did a quick version that didn't replace anything at all, and it
doesn't look like there is room for that helping much. Yes, I can speed
things up, but it didn't get much faster than just raising the delta cache
to 1024 entries.
Admittedly my quick hack might have been fundamentally flawed, but it was
such an ugly thing that I'm not even going to post it.
And the added memory footprint makes it unacceptable, so it's going to be
limited by the cache size anyway, and not get a lot of hits in git
rev-list, methinks.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 22:33 ` Shawn O. Pearce
@ 2008-08-15 1:46 ` Nicolas Pitre
0 siblings, 0 replies; 80+ messages in thread
From: Nicolas Pitre @ 2008-08-15 1:46 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Andreas Ericsson, Thomas Rast, Geert Bosch, Andi Kleen, Ken Pratt,
git
On Thu, 14 Aug 2008, Shawn O. Pearce wrote:
> Andreas Ericsson <ae@op5.se> wrote:
> > How does one go about getting the object sizes? rev-list appears
> > to have no option for it.
>
> With great pain. You can use the output of verify-pack -v to
> tell you the size of the inflated portion of the object, but for
> a delta this is the inflated size of the delta, not of the fully
> unpacked object.
Delta objects have the size of the final object in their header. There
is get_size_from_delta() extracting that information already. There is
simply no interface exporting that info to external tools but that
shouldn't be hard to add.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 21:30 ` Andi Kleen
@ 2008-08-15 16:15 ` Linus Torvalds
0 siblings, 0 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-08-15 16:15 UTC (permalink / raw)
To: Andi Kleen; +Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Ken Pratt, git
On Thu, 14 Aug 2008, Andi Kleen wrote:
>
> I would have thought that zlib has a sequential access pattern that the
> CPU prefetchers have a easy time with hiding latency.
No, the lookup tables for the patterns are quite non-sequential. It does
do a lot of indirect accesses, ie it loads data from the input stream and
then looks things up through that.
But it's quite possible that we should use different compression factors
for different object types. Right now we have different (configurable)
compression levels for loose objects and packs, but it might be
interesting to see what happens for just "packed tree objects".
The trees really end up having rather different access patterns in
pack-files. They also tend to be rather less compressible than other
blobs, since the SHA1's in there are just random binary data. They also
delta very well - obviously regular blobs do that _too_, but regular blobs
are seldom as performance-critical in git (ie once you actually unpack a
blob, there are other things going on like actually generating a diff -
but trees get unpacked over and over for "internal git reasons")
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-14 23:14 ` Linus Torvalds
2008-08-14 23:39 ` Björn Steinbrink
@ 2008-08-16 0:34 ` Linus Torvalds
2008-09-07 1:03 ` Junio C Hamano
1 sibling, 1 reply; 80+ messages in thread
From: Linus Torvalds @ 2008-08-16 0:34 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn O. Pearce, Geert Bosch, Andi Kleen, Ken Pratt, git
On Thu, 14 Aug 2008, Linus Torvalds wrote:
>
> So yeah, trees are the problem here, and yes, avoiding inflating them
> would help - but mainly because we do it something like four times per
> object on average!
Interestingly, it turns out that git also hits a sad performance downside
of using zlib.
We always tend to set "stream.avail_out" to the exact size of the expected
output. And it turns out that that means that the fast-path case of
inffast.c doesn't trigger as often as it could. This (idiotic) patch
actually seems to help performance on git rev-list by about 5%.
But maybe it's just me seeing things. But I did this because of the entry
assumptions in inflate_fast(), that code only triggers for the case of
strm->avail_out >= 258.
Sad, if true.
Linus
---
sha1_file.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/sha1_file.c b/sha1_file.c
index a57155d..5ca7ce2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1500,11 +1500,11 @@ static void *unpack_compressed_entry(struct packed_git *p,
z_stream stream;
unsigned char *buffer, *in;
- buffer = xmalloc(size + 1);
+ buffer = xmalloc(size + 256 + 1);
buffer[size] = 0;
memset(&stream, 0, sizeof(stream));
stream.next_out = buffer;
- stream.avail_out = size;
+ stream.avail_out = size + 256;
inflateInit(&stream);
do {
^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-15 0:06 ` Linus Torvalds
2008-08-15 0:25 ` Linus Torvalds
@ 2008-08-16 12:47 ` Björn Steinbrink
1 sibling, 0 replies; 80+ messages in thread
From: Björn Steinbrink @ 2008-08-16 12:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
[-- Attachment #1: Type: text/plain, Size: 996 bytes --]
On 2008.08.14 17:06:13 -0700, Linus Torvalds wrote:
> The "hash" we use for looking things up is also pretty much a joke, and it
> has no overflow capability, it just replaces the old entry with a new one.
So I added some stupid tracing to cache_or_unpack entry to see how often
we reread the same stuff. The whole thing just logs the base_offset in
case of a cache miss. I've gc'ed my linux-2.6.git before the run, so
that there's only a single packed_git around (at least I hope so), and I
can ignore that for the tracing.
The whole log for a "git rev-list --objects HEAD" has about 1.2M
entries, while the output of the rev-list command has about 870k lines.
Some postprocessing of the trace shows that the majority of objects is
read only once or twice. A few percent are read three to ten times, and
some are read more than two hundred times.
I'll attach the post-processed thing. The format is:
x y
Meaning that there were x base_offset values for which we had y cache
misses.
Björn
[-- Attachment #2: cache-misses --]
[-- Type: text/plain, Size: 2437 bytes --]
391805 1
110622 2
27830 3
13995 4
8583 5
5834 6
4275 7
3242 8
2514 9
2168 10
1632 11
1336 12
1197 13
947 14
788 15
704 16
565 17
514 18
422 19
348 20
304 21
276 22
227 23
233 24
180 25
160 26
145 27
106 28
123 29
109 30
86 31
91 32
72 33
63 34
55 35
73 36
61 37
56 38
48 39
44 40
47 41
36 42
44 43
47 44
32 45
36 46
27 47
19 48
34 49
28 50
22 51
21 52
26 53
18 54
19 55
16 56
22 57
16 58
16 59
11 60
13 61
19 62
17 63
8 64
21 65
8 66
8 67
16 68
9 69
12 70
11 71
8 72
5 73
6 74
9 75
6 76
9 77
7 78
8 79
7 80
8 81
6 82
5 83
13 84
9 85
8 86
4 87
5 89
6 90
3 91
7 92
4 93
5 94
5 95
5 96
4 97
3 98
7 99
2 100
4 101
4 102
7 103
4 104
4 105
5 106
3 107
1 108
4 109
1 110
1 111
1 112
6 113
5 114
2 115
5 116
2 117
2 118
2 119
7 120
1 121
4 122
3 123
3 124
3 125
4 126
1 127
2 128
2 129
2 130
1 131
4 132
1 133
4 134
1 135
2 136
4 137
1 139
3 140
3 141
5 142
5 143
4 144
1 148
2 149
3 150
1 151
2 152
6 153
1 154
2 155
2 156
3 157
2 158
1 159
3 160
2 161
4 162
2 163
5 164
2 165
2 166
2 169
2 170
1 171
1 172
1 173
1 176
2 177
2 178
2 179
1 180
3 181
3 182
1 183
1 184
1 186
1 187
1 190
1 192
1 194
2 195
3 196
1 197
1 200
1 201
1 202
1 208
2 214
2 216
2 217
3 224
1 225
1 228
1 230
2 232
2 233
1 234
2 236
1 239
1 241
1 245
1 246
2 249
2 250
1 252
1 259
1 261
2 263
1 266
2 268
1 272
1 282
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-08-16 0:34 ` Linus Torvalds
@ 2008-09-07 1:03 ` Junio C Hamano
2008-09-07 1:46 ` Linus Torvalds
0 siblings, 1 reply; 80+ messages in thread
From: Junio C Hamano @ 2008-09-07 1:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Interestingly, it turns out that git also hits a sad performance downside
> of using zlib.
>
> We always tend to set "stream.avail_out" to the exact size of the expected
> output. And it turns out that that means that the fast-path case of
> inffast.c doesn't trigger as often as it could. This (idiotic) patch
> actually seems to help performance on git rev-list by about 5%.
>
> But maybe it's just me seeing things. But I did this because of the entry
> assumptions in inflate_fast(), that code only triggers for the case of
> strm->avail_out >= 258.
>
> Sad, if true.
This is reproducible "rev-list --objects --all" in my copy of the kernel
repo takes around 47-48 seconds user time, and with the (idiotic) patch it
is cut down to 41-42 seconds.
(with patch)
41.41user 0.51system 0:41.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+134411minor)pagefaults 0swaps
(without patch)
47.21user 0.64system 0:47.85elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+134935minor)pagefaults 0swaps
One funny thing about your patch is that it also reduces the number of
minor faults; I would have expected that the additional memory wastage
(even though most of the allocated object buffer memory would be freed
immediately as soon as the caller is done with it) would result in larger
number of faults, not smaller, which is puzzling.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 1:03 ` Junio C Hamano
@ 2008-09-07 1:46 ` Linus Torvalds
2008-09-07 2:33 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-09-07 1:46 UTC (permalink / raw)
To: Junio C Hamano
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
On Sat, 6 Sep 2008, Junio C Hamano wrote:
>
> This is reproducible "rev-list --objects --all" in my copy of the kernel
> repo takes around 47-48 seconds user time, and with the (idiotic) patch it
> is cut down to 41-42 seconds.
So I had forgotten about that patch since nobody reacted to it.
I think the patch is wrong, please don't apply it, even though it does
help performance.
The reason?
Right now we depend on "avail_out" also making zlib understand to stop
looking at the input stream. Sad, but true - we don't know or care about
the compressed size of the object, only the uncompressed size. So in
unpack_compressed_entry(), we simply set the output length, and expect
zlib to stop when it's sufficient.
Which it does - but the patch kind of violates that whole design.
Now, it so happens that things seem to work, probably because the zlib
format does have enough synchronization in it to not try to continue past
the end _anyway_, but I think this makes the patch be of debatable value.
I'm starting to hate zlib. I actually spent almost a week trying to clean
up the zlib source code and make it something that gcc can compile into
clean code, but the fact is, zlib isn't amenable to that. The whole "shift
<n> bits in from the buffer" approach means that there is no way to make
zlib generate good code unless you are an insanely competent assembly
hacker or have tons of registers to keep all the temporaries live in.
Now, I still do think that all my reasons for choosing zlib were pretty
solid (it's a well-tested piece of code and it is _everywhere_ and easy to
use), but boy do I wish there had been alternatives.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 1:46 ` Linus Torvalds
@ 2008-09-07 2:33 ` Junio C Hamano
2008-09-07 17:11 ` Nicolas Pitre
2008-09-07 2:50 ` Jon Smirl
2008-09-07 7:45 ` Mike Hommey
2 siblings, 1 reply; 80+ messages in thread
From: Junio C Hamano @ 2008-09-07 2:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nicolas Pitre, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
Linus Torvalds <torvalds@linux-foundation.org> writes:
> The reason?
>
> Right now we depend on "avail_out" also making zlib understand to stop
> looking at the input stream. Sad, but true - we don't know or care about
> the compressed size of the object, only the uncompressed size. So in
> unpack_compressed_entry(), we simply set the output length, and expect
> zlib to stop when it's sufficient.
>
> Which it does - but the patch kind of violates that whole design.
>
> Now, it so happens that things seem to work, probably because the zlib
> format does have enough synchronization in it to not try to continue past
> the end _anyway_, but I think this makes the patch be of debatable value.
I thought the fact we do check the status with Z_STREAM_END means that we
do already expect and rely on zlib to know where the end of input stream
is, and stop there (otherwise we say something fishy is going on and we
error out), and it was part of the design, not just "so happens" and "has
enough synch ... _anyway_".
If input zlib stream were corrupted and it detected the end of stream too
early, then check of "stream.total_out != size" would fail even though we
would see "st == Z_STREAM_END". If input stream were corrupted and it
went past the end marker, we will read past the end and into some garbage
that is the in-pack header of the next object representation, but zlib
shouldn't go berserk even in that case, and would stop after filling the
slop you allocated in the buffer --- we would detect the situation from
stream.total_out != size and most likely st != Z_STREAM_END in such a
case.
While I think 5% is large enough, I'll leave this on the backburner for
now. I think it is more grave issue that we inflate the same object many
times as you noticed during the discussion.
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 1:46 ` Linus Torvalds
2008-09-07 2:33 ` Junio C Hamano
@ 2008-09-07 2:50 ` Jon Smirl
2008-09-07 3:07 ` Linus Torvalds
2008-09-07 7:45 ` Mike Hommey
2 siblings, 1 reply; 80+ messages in thread
From: Jon Smirl @ 2008-09-07 2:50 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
On 9/6/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> I'm starting to hate zlib. I actually spent almost a week trying to clean
> up the zlib source code and make it something that gcc can compile into
> clean code, but the fact is, zlib isn't amenable to that. The whole "shift
> <n> bits in from the buffer" approach means that there is no way to make
> zlib generate good code unless you are an insanely competent assembly
> hacker or have tons of registers to keep all the temporaries live in.
>
> Now, I still do think that all my reasons for choosing zlib were pretty
> solid (it's a well-tested piece of code and it is _everywhere_ and easy to
> use), but boy do I wish there had been alternatives.
Some alternative algorithms are here...
http://cs.fit.edu/~mmahoney/compression
It is possible to beat zlib by 2x at the cost of CPU time and memory.
Of course switching to these algorithms would involve a lot of testing
and benchmarking. I'm also not sure how PAQ would fare on lots of
small git objects instead of large files.
Turning a 500MB packfile into a 250MB has lots of advantages in IO
reduction so it is worth some CPU/memory to create it.
You can even win 50'000€ for a better algorithm.
http://prize.hutter1.net/
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 2:50 ` Jon Smirl
@ 2008-09-07 3:07 ` Linus Torvalds
2008-09-07 3:43 ` Jon Smirl
2008-09-07 8:18 ` Andreas Ericsson
0 siblings, 2 replies; 80+ messages in thread
From: Linus Torvalds @ 2008-09-07 3:07 UTC (permalink / raw)
To: Jon Smirl; +Cc: git
On Sat, 6 Sep 2008, Jon Smirl wrote:
>
> Some alternative algorithms are here...
> http://cs.fit.edu/~mmahoney/compression
> It is possible to beat zlib by 2x at the cost of CPU time and memory.
Jon, you're missing the point.
The problem with zlib isn't that it doesn't compress well. It's that it's
too _SLOW_.
> Turning a 500MB packfile into a 250MB has lots of advantages in IO
> reduction so it is worth some CPU/memory to create it.
..and secondly, there's no way you'll find a compressor that comes even
close to being twice as good. 10% better yes - but then generally much
MUCH slower.
Take a look at that web page you quote, and then sort things by
decompression speed. THAT is the issue.
And no, LZO isn't even on that list. I haven't tested it, but looking at
the code, I do think LZO can be fast exactly because it seems to be
byte-based rather than bit-based, so I'd not be surprised if the claims
for its uncompression speed are true.
The constant bit-shifting/masking/extraction kills zlib performance (and
please realize that zlib is at the TOP of the list when looking at the
thing you pointed to - that silly site seems to not care about compressor
speed at all, _only_ about size). So "kills" is a relative measure, but
really - we're looking for _faster_ algorithms, not slower ones!
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 3:07 ` Linus Torvalds
@ 2008-09-07 3:43 ` Jon Smirl
2008-09-07 4:50 ` Linus Torvalds
2008-09-07 8:18 ` Andreas Ericsson
1 sibling, 1 reply; 80+ messages in thread
From: Jon Smirl @ 2008-09-07 3:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
On 9/6/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Sat, 6 Sep 2008, Jon Smirl wrote:
> >
> > Some alternative algorithms are here...
> > http://cs.fit.edu/~mmahoney/compression
> > It is possible to beat zlib by 2x at the cost of CPU time and memory.
>
>
> Jon, you're missing the point.
>
> The problem with zlib isn't that it doesn't compress well. It's that it's
> too _SLOW_.
When I was playing with those giant Mozilla packs speed of zlib wasn't
a big problem. Number one problem was the repack process exceeding 3GB
which forced me to get 64b hardware and 8GB of memory. If you start
swapping in a repack, kill it, it will probably take a month to
finish.
I'm forgetting the numbers now but on a quad core machine (with git
changes to use all cores) and 8GB I believe I was able to repack the
Mozilla repo in under an hour. At that point I believe I was being
limited by disk IO.
Size and speed are not unrelated. Buy reducing the pack size in half
you reduce the IO and memory demands (cache misses) a lot. For example
if we went to no compression we'd be killed by memory and IO
consumption. It's not obvious to me what's the best trade off for git
without trying several compression algorithms and comparing. They were
feeding 100MB into PAQ on that site, I don't know what PAQ would do
with a bunch of 2K objects.
Most delta chains in the Mozilla data were easy to process. There was
a single 2000 delta chain that consumed 15% of the total CPU time to
process. Something causes performance to fall apart on really long
chains.
> > Turning a 500MB packfile into a 250MB has lots of advantages in IO
> > reduction so it is worth some CPU/memory to create it.
>
>
> ..and secondly, there's no way you'll find a compressor that comes even
> close to being twice as good. 10% better yes - but then generally much
> MUCH slower.
>
> Take a look at that web page you quote, and then sort things by
> decompression speed. THAT is the issue.
>
> And no, LZO isn't even on that list. I haven't tested it, but looking at
> the code, I do think LZO can be fast exactly because it seems to be
> byte-based rather than bit-based, so I'd not be surprised if the claims
> for its uncompression speed are true.
>
> The constant bit-shifting/masking/extraction kills zlib performance (and
> please realize that zlib is at the TOP of the list when looking at the
> thing you pointed to - that silly site seems to not care about compressor
> speed at all, _only_ about size). So "kills" is a relative measure, but
> really - we're looking for _faster_ algorithms, not slower ones!
>
>
> Linus
>
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 3:43 ` Jon Smirl
@ 2008-09-07 4:50 ` Linus Torvalds
2008-09-07 13:58 ` Jon Smirl
0 siblings, 1 reply; 80+ messages in thread
From: Linus Torvalds @ 2008-09-07 4:50 UTC (permalink / raw)
To: Jon Smirl; +Cc: git
On Sat, 6 Sep 2008, Jon Smirl wrote:
>
> When I was playing with those giant Mozilla packs speed of zlib wasn't
> a big problem. Number one problem was the repack process exceeding 3GB
> which forced me to get 64b hardware and 8GB of memory. If you start
> swapping in a repack, kill it, it will probably take a month to
> finish.
.. and you'd make things much much WORSE?
> Size and speed are not unrelated.
Jon, go away.
Go and _look_ at those damn numbers you tried to point me to.
Those "better" compression models you pointed at are not only hundreds of
times slower than zlib, they take hundreds of times more memory too!
Yes, size and speed are definitely not unrelated. And in this situation,
when it comes to compression algorithms, the relationship is _very_ clear:
- better compression takes more memory and is slower
Really. You're trying to argue for something, but you don't seem to
realize that you argue _against_ the thing you think you are arguing for.
Linus
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 1:46 ` Linus Torvalds
2008-09-07 2:33 ` Junio C Hamano
2008-09-07 2:50 ` Jon Smirl
@ 2008-09-07 7:45 ` Mike Hommey
2 siblings, 0 replies; 80+ messages in thread
From: Mike Hommey @ 2008-09-07 7:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Nicolas Pitre, Shawn O. Pearce, Geert Bosch,
Andi Kleen, Ken Pratt, git
On Sat, Sep 06, 2008 at 06:46:29PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 6 Sep 2008, Junio C Hamano wrote:
> >
> > This is reproducible "rev-list --objects --all" in my copy of the kernel
> > repo takes around 47-48 seconds user time, and with the (idiotic) patch it
> > is cut down to 41-42 seconds.
>
> So I had forgotten about that patch since nobody reacted to it.
>
> I think the patch is wrong, please don't apply it, even though it does
> help performance.
>
> The reason?
>
> Right now we depend on "avail_out" also making zlib understand to stop
> looking at the input stream. Sad, but true - we don't know or care about
> the compressed size of the object, only the uncompressed size. So in
> unpack_compressed_entry(), we simply set the output length, and expect
> zlib to stop when it's sufficient.
>
> Which it does - but the patch kind of violates that whole design.
>
> Now, it so happens that things seem to work, probably because the zlib
> format does have enough synchronization in it to not try to continue past
> the end _anyway_, but I think this makes the patch be of debatable value.
>
> I'm starting to hate zlib. I actually spent almost a week trying to clean
> up the zlib source code and make it something that gcc can compile into
> clean code, but the fact is, zlib isn't amenable to that. The whole "shift
> <n> bits in from the buffer" approach means that there is no way to make
> zlib generate good code unless you are an insanely competent assembly
> hacker or have tons of registers to keep all the temporaries live in.
>
> Now, I still do think that all my reasons for choosing zlib were pretty
> solid (it's a well-tested piece of code and it is _everywhere_ and easy to
> use), but boy do I wish there had been alternatives.
I know at least 7-zip has its own gzip compression/decompression code
(though it's C++). Maybe some other tools have theirs too.
Anyways, if it can make a speed difference, it might be worth having a
minimalist custom gzip compression/decompression "library" embedded
withing git.
Mike
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 3:07 ` Linus Torvalds
2008-09-07 3:43 ` Jon Smirl
@ 2008-09-07 8:18 ` Andreas Ericsson
1 sibling, 0 replies; 80+ messages in thread
From: Andreas Ericsson @ 2008-09-07 8:18 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jon Smirl, git
Linus Torvalds wrote:
>
> Take a look at that web page you quote, and then sort things by
> decompression speed. THAT is the issue.
>
> And no, LZO isn't even on that list. I haven't tested it, but looking at
> the code, I do think LZO can be fast exactly because it seems to be
> byte-based rather than bit-based, so I'd not be surprised if the claims
> for its uncompression speed are true.
>
Some lzo vs zlib benchmark figures (for git) are available here:
http://www.gelato.unsw.edu.au/archives/git/0504/1700.html
LZO also ships their "minilzo.[ch]" fileset for easy inclusion in other
projects. I've used it a couple of times with decent results.
As for testing, both have been thoroughly vetted by NASA. LZO is used for
communication with satellites and that spacestation thing they had some
time ago, while zlib is being used for sending data back from Hubble and
other large data gatherers.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 4:50 ` Linus Torvalds
@ 2008-09-07 13:58 ` Jon Smirl
2008-09-07 17:08 ` Nicolas Pitre
0 siblings, 1 reply; 80+ messages in thread
From: Jon Smirl @ 2008-09-07 13:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
On 9/7/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Sat, 6 Sep 2008, Jon Smirl wrote:
> >
>
> > When I was playing with those giant Mozilla packs speed of zlib wasn't
> > a big problem. Number one problem was the repack process exceeding 3GB
> > which forced me to get 64b hardware and 8GB of memory. If you start
> > swapping in a repack, kill it, it will probably take a month to
> > finish.
>
>
> .. and you'd make things much much WORSE?
My observations on the Mozilla packs indicated that the problems were
elsewhere in git, not in the decompression algorithms. Why does a
single 2000 delta chain take 15% of the entire pack time? Something
isn't right when long chains are processed which triggers far more
decompressions than needed.
>
>
> > Size and speed are not unrelated.
>
>
> Jon, go away.
>
> Go and _look_ at those damn numbers you tried to point me to.
>
> Those "better" compression models you pointed at are not only hundreds of
> times slower than zlib, they take hundreds of times more memory too!
>
> Yes, size and speed are definitely not unrelated. And in this situation,
> when it comes to compression algorithms, the relationship is _very_ clear:
>
> - better compression takes more memory and is slower
>
> Really. You're trying to argue for something, but you don't seem to
> realize that you argue _against_ the thing you think you are arguing for.
>
>
> Linus
>
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 13:58 ` Jon Smirl
@ 2008-09-07 17:08 ` Nicolas Pitre
2008-09-07 20:33 ` Jon Smirl
0 siblings, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-09-07 17:08 UTC (permalink / raw)
To: Jon Smirl; +Cc: Linus Torvalds, git
On Sun, 7 Sep 2008, Jon Smirl wrote:
> On 9/7/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >
> > On Sat, 6 Sep 2008, Jon Smirl wrote:
> > >
> >
> > > When I was playing with those giant Mozilla packs speed of zlib wasn't
> > > a big problem. Number one problem was the repack process exceeding 3GB
> > > which forced me to get 64b hardware and 8GB of memory. If you start
> > > swapping in a repack, kill it, it will probably take a month to
> > > finish.
> >
> >
> > .. and you'd make things much much WORSE?
>
> My observations on the Mozilla packs indicated that the problems were
> elsewhere in git, not in the decompression algorithms. Why does a
> single 2000 delta chain take 15% of the entire pack time? Something
> isn't right when long chains are processed which triggers far more
> decompressions than needed.
Please have a look at commit eac12e2d4d7f. This fix improved things for
my gcc repack tests.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 2:33 ` Junio C Hamano
@ 2008-09-07 17:11 ` Nicolas Pitre
2008-09-07 17:41 ` Junio C Hamano
0 siblings, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-09-07 17:11 UTC (permalink / raw)
To: Junio C Hamano
Cc: Linus Torvalds, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
On Sat, 6 Sep 2008, Junio C Hamano wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > The reason?
> >
> > Right now we depend on "avail_out" also making zlib understand to stop
> > looking at the input stream. Sad, but true - we don't know or care about
> > the compressed size of the object, only the uncompressed size. So in
> > unpack_compressed_entry(), we simply set the output length, and expect
> > zlib to stop when it's sufficient.
> >
> > Which it does - but the patch kind of violates that whole design.
> >
> > Now, it so happens that things seem to work, probably because the zlib
> > format does have enough synchronization in it to not try to continue past
> > the end _anyway_, but I think this makes the patch be of debatable value.
>
> I thought the fact we do check the status with Z_STREAM_END means that we
> do already expect and rely on zlib to know where the end of input stream
> is, and stop there (otherwise we say something fishy is going on and we
> error out), and it was part of the design, not just "so happens" and "has
> enough synch ... _anyway_".
>
> If input zlib stream were corrupted and it detected the end of stream too
> early, then check of "stream.total_out != size" would fail even though we
> would see "st == Z_STREAM_END". If input stream were corrupted and it
> went past the end marker, we will read past the end and into some garbage
> that is the in-pack header of the next object representation, but zlib
> shouldn't go berserk even in that case, and would stop after filling the
> slop you allocated in the buffer --- we would detect the situation from
> stream.total_out != size and most likely st != Z_STREAM_END in such a
> case.
Unless I'm missing something, I think your analysis is right and
everything should be safe.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 17:11 ` Nicolas Pitre
@ 2008-09-07 17:41 ` Junio C Hamano
0 siblings, 0 replies; 80+ messages in thread
From: Junio C Hamano @ 2008-09-07 17:41 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Linus Torvalds, Shawn O. Pearce, Geert Bosch, Andi Kleen,
Ken Pratt, git
Nicolas Pitre <nico@cam.org> writes:
> On Sat, 6 Sep 2008, Junio C Hamano wrote:
>
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> ...
>> > Which it does - but the patch kind of violates that whole design.
>> >
>> > Now, it so happens that things seem to work, probably because the zlib
>> > format does have enough synchronization in it to not try to continue past
>> > the end _anyway_, but I think this makes the patch be of debatable value.
>>
>> I thought the fact we do check the status with Z_STREAM_END means that we
>> do already expect and rely on zlib to know where the end of input stream
>> is, and stop there (otherwise we say something fishy is going on and we
>> error out), and it was part of the design, not just "so happens" and "has
>> enough synch ... _anyway_".
>>
>> If input zlib stream were corrupted and it detected the end of stream too
>> early, then check of "stream.total_out != size" would fail even though we
>> would see "st == Z_STREAM_END". If input stream were corrupted and it
>> went past the end marker, we will read past the end and into some garbage
>> that is the in-pack header of the next object representation, but zlib
>> shouldn't go berserk even in that case, and would stop after filling the
>> slop you allocated in the buffer --- we would detect the situation from
>> stream.total_out != size and most likely st != Z_STREAM_END in such a
>> case.
>
> Unless I'm missing something, I think your analysis is right and
> everything should be safe.
I obviously agree with you but what I forgot to mention in the above is
that we also make sure stream.avail_in is set not to overrun the end of
the current pack window (or the entire loose object data that is
mmapped).
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 17:08 ` Nicolas Pitre
@ 2008-09-07 20:33 ` Jon Smirl
2008-09-08 14:17 ` Nicolas Pitre
0 siblings, 1 reply; 80+ messages in thread
From: Jon Smirl @ 2008-09-07 20:33 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git
On 9/7/08, Nicolas Pitre <nico@cam.org> wrote:
> On Sun, 7 Sep 2008, Jon Smirl wrote:
>
> > On 9/7/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > >
> > > On Sat, 6 Sep 2008, Jon Smirl wrote:
> > > >
> > >
> > > > When I was playing with those giant Mozilla packs speed of zlib wasn't
> > > > a big problem. Number one problem was the repack process exceeding 3GB
> > > > which forced me to get 64b hardware and 8GB of memory. If you start
> > > > swapping in a repack, kill it, it will probably take a month to
> > > > finish.
> > >
> > >
> > > .. and you'd make things much much WORSE?
> >
> > My observations on the Mozilla packs indicated that the problems were
> > elsewhere in git, not in the decompression algorithms. Why does a
> > single 2000 delta chain take 15% of the entire pack time? Something
> > isn't right when long chains are processed which triggers far more
> > decompressions than needed.
>
>
> Please have a look at commit eac12e2d4d7f. This fix improved things for
> my gcc repack tests.
Do you have any test numbers for something like a 2000 delta chain
before and after?
You can get to Mozilla CVS with rsync.
https://wiki.mozilla.org/How_to_Create_a_CVS_Mirror
I think it was the master Mozilla makefile with the 2000 deltas.
The whole repo is 15GB so you probably just want the Makefile,v
There's no point in working with Mozilla except for testing purposes
since they went with Mercurial and abandoned their history.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-07 20:33 ` Jon Smirl
@ 2008-09-08 14:17 ` Nicolas Pitre
2008-09-08 15:12 ` Jon Smirl
0 siblings, 1 reply; 80+ messages in thread
From: Nicolas Pitre @ 2008-09-08 14:17 UTC (permalink / raw)
To: Jon Smirl; +Cc: git
On Sun, 7 Sep 2008, Jon Smirl wrote:
> On 9/7/08, Nicolas Pitre <nico@cam.org> wrote:
> > Please have a look at commit eac12e2d4d7f. This fix improved things for
> > my gcc repack tests.
>
> Do you have any test numbers for something like a 2000 delta chain
> before and after?
What kind of number do you want?
Before that change I wasn't able to repack an already tightly packed
(about 340MB) gcc repository on my machine while the same but sparsely
packed (3GB or so) repository could be repacked just fine.
> You can get to Mozilla CVS with rsync.
> https://wiki.mozilla.org/How_to_Create_a_CVS_Mirror
> I think it was the master Mozilla makefile with the 2000 deltas.
> The whole repo is 15GB so you probably just want the Makefile,v
I have a test Mozilla repo dating back to the time you were playing with
it too (I think). Its directory date is 2007-04-12. It was quite
tightly packed already, but I just ran a "git repack -a -d -f
--window=100 --depth=2000" on it and now have a 380MB pack file for it.
Nicolas
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-08 14:17 ` Nicolas Pitre
@ 2008-09-08 15:12 ` Jon Smirl
2008-09-08 16:01 ` Jon Smirl
0 siblings, 1 reply; 80+ messages in thread
From: Jon Smirl @ 2008-09-08 15:12 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git
On 9/8/08, Nicolas Pitre <nico@cam.org> wrote:
> On Sun, 7 Sep 2008, Jon Smirl wrote:
>
> > On 9/7/08, Nicolas Pitre <nico@cam.org> wrote:
>
> > > Please have a look at commit eac12e2d4d7f. This fix improved things for
> > > my gcc repack tests.
> >
> > Do you have any test numbers for something like a 2000 delta chain
> > before and after?
>
>
> What kind of number do you want?
See if repacking a 2000 chain delta still takes 30 minutes. It can be
any 2000 chain delta.
> Before that change I wasn't able to repack an already tightly packed
> (about 340MB) gcc repository on my machine while the same but sparsely
> packed (3GB or so) repository could be repacked just fine.
>
>
> > You can get to Mozilla CVS with rsync.
> > https://wiki.mozilla.org/How_to_Create_a_CVS_Mirror
> > I think it was the master Mozilla makefile with the 2000 deltas.
> > The whole repo is 15GB so you probably just want the Makefile,v
>
>
> I have a test Mozilla repo dating back to the time you were playing with
> it too (I think). Its directory date is 2007-04-12. It was quite
> tightly packed already, but I just ran a "git repack -a -d -f
> --window=100 --depth=2000" on it and now have a 380MB pack file for it.
>
>
>
> Nicolas
>
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: pack operation is thrashing my server
2008-09-08 15:12 ` Jon Smirl
@ 2008-09-08 16:01 ` Jon Smirl
0 siblings, 0 replies; 80+ messages in thread
From: Jon Smirl @ 2008-09-08 16:01 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git
On 9/8/08, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 9/8/08, Nicolas Pitre <nico@cam.org> wrote:
> > On Sun, 7 Sep 2008, Jon Smirl wrote:
> >
> > > On 9/7/08, Nicolas Pitre <nico@cam.org> wrote:
> >
> > > > Please have a look at commit eac12e2d4d7f. This fix improved things for
> > > > my gcc repack tests.
> > >
> > > Do you have any test numbers for something like a 2000 delta chain
> > > before and after?
> >
> >
> > What kind of number do you want?
>
>
> See if repacking a 2000 chain delta still takes 30 minutes. It can be
> any 2000 chain delta.
Time for repacking a 2000 chain delta would be a good thing to monitor
as part of the testing process. It amplifies any small performance
problems and makes them obvious.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 80+ messages in thread
end of thread, other threads:[~2008-09-08 16:03 UTC | newest]
Thread overview: 80+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-10 19:47 pack operation is thrashing my server Ken Pratt
2008-08-10 23:06 ` Martin Langhoff
2008-08-10 23:12 ` Ken Pratt
2008-08-10 23:30 ` Martin Langhoff
2008-08-10 23:34 ` Ken Pratt
2008-08-11 3:04 ` Shawn O. Pearce
2008-08-11 7:43 ` Ken Pratt
2008-08-11 15:01 ` Shawn O. Pearce
2008-08-11 15:40 ` Avery Pennarun
2008-08-11 15:59 ` Shawn O. Pearce
2008-08-11 19:13 ` Ken Pratt
2008-08-11 19:10 ` Andi Kleen
2008-08-11 19:15 ` Ken Pratt
2008-08-13 2:38 ` Nicolas Pitre
2008-08-13 2:50 ` Andi Kleen
2008-08-13 2:57 ` Shawn O. Pearce
2008-08-11 19:22 ` Shawn O. Pearce
2008-08-11 19:29 ` Ken Pratt
2008-08-11 19:34 ` Shawn O. Pearce
2008-08-11 20:10 ` Andi Kleen
2008-08-13 3:12 ` Geert Bosch
2008-08-13 3:15 ` Shawn O. Pearce
2008-08-13 3:58 ` Geert Bosch
2008-08-13 14:37 ` Nicolas Pitre
2008-08-13 14:56 ` Jakub Narebski
2008-08-13 15:04 ` Shawn O. Pearce
2008-08-13 15:26 ` David Tweed
2008-08-13 23:54 ` Martin Langhoff
2008-08-14 9:04 ` David Tweed
2008-08-13 16:10 ` Johan Herland
2008-08-13 17:38 ` Ken Pratt
2008-08-13 17:57 ` Nicolas Pitre
2008-08-13 14:35 ` Nicolas Pitre
2008-08-13 14:59 ` Shawn O. Pearce
2008-08-13 15:43 ` Nicolas Pitre
2008-08-13 15:50 ` Shawn O. Pearce
2008-08-13 17:04 ` Nicolas Pitre
2008-08-13 17:19 ` Shawn O. Pearce
2008-08-14 6:33 ` Andreas Ericsson
2008-08-14 10:04 ` Thomas Rast
2008-08-14 10:15 ` Andreas Ericsson
2008-08-14 22:33 ` Shawn O. Pearce
2008-08-15 1:46 ` Nicolas Pitre
2008-08-14 14:01 ` Nicolas Pitre
2008-08-14 17:21 ` Linus Torvalds
2008-08-14 17:58 ` Linus Torvalds
2008-08-14 19:04 ` Nicolas Pitre
2008-08-14 19:44 ` Linus Torvalds
2008-08-14 21:30 ` Andi Kleen
2008-08-15 16:15 ` Linus Torvalds
2008-08-14 21:50 ` Nicolas Pitre
2008-08-14 23:14 ` Linus Torvalds
2008-08-14 23:39 ` Björn Steinbrink
2008-08-15 0:06 ` Linus Torvalds
2008-08-15 0:25 ` Linus Torvalds
2008-08-16 12:47 ` Björn Steinbrink
2008-08-16 0:34 ` Linus Torvalds
2008-09-07 1:03 ` Junio C Hamano
2008-09-07 1:46 ` Linus Torvalds
2008-09-07 2:33 ` Junio C Hamano
2008-09-07 17:11 ` Nicolas Pitre
2008-09-07 17:41 ` Junio C Hamano
2008-09-07 2:50 ` Jon Smirl
2008-09-07 3:07 ` Linus Torvalds
2008-09-07 3:43 ` Jon Smirl
2008-09-07 4:50 ` Linus Torvalds
2008-09-07 13:58 ` Jon Smirl
2008-09-07 17:08 ` Nicolas Pitre
2008-09-07 20:33 ` Jon Smirl
2008-09-08 14:17 ` Nicolas Pitre
2008-09-08 15:12 ` Jon Smirl
2008-09-08 16:01 ` Jon Smirl
2008-09-07 8:18 ` Andreas Ericsson
2008-09-07 7:45 ` Mike Hommey
2008-08-14 18:38 ` Nicolas Pitre
2008-08-14 18:55 ` Linus Torvalds
2008-08-13 16:01 ` Geert Bosch
2008-08-13 17:13 ` Dana How
2008-08-13 17:26 ` Nicolas Pitre
2008-08-13 12:43 ` Jakub Narebski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).