From: Linus Torvalds <torvalds@linux-foundation.org>
To: Nicolas Pitre <nico@cam.org>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
Geert Bosch <bosch@adacore.com>, Andi Kleen <andi@firstfloor.org>,
Ken Pratt <ken@kenpratt.net>,
git@vger.kernel.org
Subject: Re: pack operation is thrashing my server
Date: Thu, 14 Aug 2008 10:58:01 -0700 (PDT) [thread overview]
Message-ID: <alpine.LFD.1.10.0808141022500.3324@nehalem.linux-foundation.org> (raw)
In-Reply-To: <alpine.LFD.1.10.0808141014410.3324@nehalem.linux-foundation.org>
On Thu, 14 Aug 2008, Linus Torvalds wrote:
>
> Doing a rev-list of all objects is a fairly rare operation, but even if
> you want to clone/repack all of your archives the whole time, please
> realize that listing objects is _not_ a simple operation. It opens up and
> parses every single tree in the whole history. That's a _lot_ of data to
> unpack.
Btw, it's not that hard to run oprofile (link git statically to get better
numbers). For me, the answer to what is going on for a kernel rev-list is
pretty straightforward:
263742 26.6009 lookup_object
135945 13.7113 inflate
110525 11.1475 inflate_fast
75124 7.5770 inflate_table
64676 6.5232 strlen
48635 4.9053 memcpy
47744 4.8154 find_pack_entry_one
35265 3.5568 _int_malloc
31579 3.1850 decode_tree_entry
28388 2.8632 adler32
19441 1.9608 process_tree
10398 1.0487 patch_delta
8925 0.9002 _int_free
..
so most of it is in inflate, but I suspect the cost of "lookup_object()"
is so high becuase when we parse the trees we also have to look up every
blob - even if they didn't change - just to see whether we already saw it
or not.
For me, an instruction-level profile of lookup_object() shows that the
cost is all in the hashcmp (53% of the profile is on that "repz cmpsb")
and in the loading of the object pointer (26% of the profile is on the
test instruction after the "obj_hash[i]" load). I don't think we can
really improve that code much - the hash table is very efficient, and the
cost is just in the fact that we have a lot of meory accesses.
We could try to use the (more memory-hungry) "hash.c" implementation for
object hashing, which actually includes a 32-bit key inside the hash
table, but while that will avoid the cost of fetching the object pointer
for the cases where we have collisions, most of the time the cost is not
in the collision, but in the fact that we _hit_.
I bet the hit percentage is 90+%, and the cost really is just that we
encounter the same object hundreds or thousands of times.
Please realize that even if there may be "only" a million objects in the
kernel, there are *MANY* more ways to _reach_ those objects, and that is
what git-rev-list --objects does! It's not O(number-of-objects), it's
O(number-of-object-linkages).
For my current kernel archive, for example, the number of objects is
roughly 900k. However, think about how many times we'll actually reach a
blob: that's roughly (blobs per commit)*(number of commits), which can be
approximated with
echo $(( $(git ls-files | wc -l) * $(git rev-list --all | wc -l) ))
which is 24324*108518=2639591832 ie about 2.5 _billion_ times.
Now, we don't actually do anything close to that many lookups, because
when a subdirectory doesn't change at all, we'll skip the whole tree after
having seen it just once, so that will cut down on the number of objects
we have to look up by probably a couple of orders of magnitude.
But this is why the "one large directory" load performs worse: in the
worst case, if you really have a totally flat directory tree, you'd
literally see that 2.5 billion object lookup case.
So it's not that git scales badly. It's that "git rev-list --objects" is
really a very expensive operation, and while some good practices (deep
directory structures) makes it able to optimize the load away a lot, it's
still potentially very tough.
Linus
next prev parent reply other threads:[~2008-08-14 17:59 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-10 19:47 pack operation is thrashing my server Ken Pratt
2008-08-10 23:06 ` Martin Langhoff
2008-08-10 23:12 ` Ken Pratt
2008-08-10 23:30 ` Martin Langhoff
2008-08-10 23:34 ` Ken Pratt
2008-08-11 3:04 ` Shawn O. Pearce
2008-08-11 7:43 ` Ken Pratt
2008-08-11 15:01 ` Shawn O. Pearce
2008-08-11 15:40 ` Avery Pennarun
2008-08-11 15:59 ` Shawn O. Pearce
2008-08-11 19:13 ` Ken Pratt
2008-08-11 19:10 ` Andi Kleen
2008-08-11 19:15 ` Ken Pratt
2008-08-13 2:38 ` Nicolas Pitre
2008-08-13 2:50 ` Andi Kleen
2008-08-13 2:57 ` Shawn O. Pearce
2008-08-11 19:22 ` Shawn O. Pearce
2008-08-11 19:29 ` Ken Pratt
2008-08-11 19:34 ` Shawn O. Pearce
2008-08-11 20:10 ` Andi Kleen
2008-08-13 3:12 ` Geert Bosch
2008-08-13 3:15 ` Shawn O. Pearce
2008-08-13 3:58 ` Geert Bosch
2008-08-13 14:37 ` Nicolas Pitre
2008-08-13 14:56 ` Jakub Narebski
2008-08-13 15:04 ` Shawn O. Pearce
2008-08-13 15:26 ` David Tweed
2008-08-13 23:54 ` Martin Langhoff
2008-08-14 9:04 ` David Tweed
2008-08-13 16:10 ` Johan Herland
2008-08-13 17:38 ` Ken Pratt
2008-08-13 17:57 ` Nicolas Pitre
2008-08-13 14:35 ` Nicolas Pitre
2008-08-13 14:59 ` Shawn O. Pearce
2008-08-13 15:43 ` Nicolas Pitre
2008-08-13 15:50 ` Shawn O. Pearce
2008-08-13 17:04 ` Nicolas Pitre
2008-08-13 17:19 ` Shawn O. Pearce
2008-08-14 6:33 ` Andreas Ericsson
2008-08-14 10:04 ` Thomas Rast
2008-08-14 10:15 ` Andreas Ericsson
2008-08-14 22:33 ` Shawn O. Pearce
2008-08-15 1:46 ` Nicolas Pitre
2008-08-14 14:01 ` Nicolas Pitre
2008-08-14 17:21 ` Linus Torvalds
2008-08-14 17:58 ` Linus Torvalds [this message]
2008-08-14 19:04 ` Nicolas Pitre
2008-08-14 19:44 ` Linus Torvalds
2008-08-14 21:30 ` Andi Kleen
2008-08-15 16:15 ` Linus Torvalds
2008-08-14 21:50 ` Nicolas Pitre
2008-08-14 23:14 ` Linus Torvalds
2008-08-14 23:39 ` Björn Steinbrink
2008-08-15 0:06 ` Linus Torvalds
2008-08-15 0:25 ` Linus Torvalds
2008-08-16 12:47 ` Björn Steinbrink
2008-08-16 0:34 ` Linus Torvalds
2008-09-07 1:03 ` Junio C Hamano
2008-09-07 1:46 ` Linus Torvalds
2008-09-07 2:33 ` Junio C Hamano
2008-09-07 17:11 ` Nicolas Pitre
2008-09-07 17:41 ` Junio C Hamano
2008-09-07 2:50 ` Jon Smirl
2008-09-07 3:07 ` Linus Torvalds
2008-09-07 3:43 ` Jon Smirl
2008-09-07 4:50 ` Linus Torvalds
2008-09-07 13:58 ` Jon Smirl
2008-09-07 17:08 ` Nicolas Pitre
2008-09-07 20:33 ` Jon Smirl
2008-09-08 14:17 ` Nicolas Pitre
2008-09-08 15:12 ` Jon Smirl
2008-09-08 16:01 ` Jon Smirl
2008-09-07 8:18 ` Andreas Ericsson
2008-09-07 7:45 ` Mike Hommey
2008-08-14 18:38 ` Nicolas Pitre
2008-08-14 18:55 ` Linus Torvalds
2008-08-13 16:01 ` Geert Bosch
2008-08-13 17:13 ` Dana How
2008-08-13 17:26 ` Nicolas Pitre
2008-08-13 12:43 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.1.10.0808141022500.3324@nehalem.linux-foundation.org \
--to=torvalds@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=bosch@adacore.com \
--cc=git@vger.kernel.org \
--cc=ken@kenpratt.net \
--cc=nico@cam.org \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).