[PATCH/RFC] fast-import: insert new object entries at start of hash bucket

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC] fast-import: insert new object entries at start of hash bucket
@ 2010-11-23  7:53 Jonathan Nieder
  2010-11-23 12:51 ` Sverre Rabbelier
  2010-11-23 22:33 ` Shawn Pearce
  0 siblings, 2 replies; 6+ messages in thread
From: Jonathan Nieder @ 2010-11-23  7:53 UTC (permalink / raw)
  To: git
  Cc: Shawn O. Pearce, David Barr, Nicolas Pitre, Raja R Harinath,
	Sverre Rabbelier

From: David Barr <david.barr@cordelta.com>

More often than not, find_object is called for recently inserted objects.
Optimise for this case by inserting new entries at the start of the chain.
This doesn't affect the cost of new inserts but reduces the cost of find
and insert for existing object entries.

Signed-off-by: David Barr <david.barr@cordelta.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
Hi,

After importing the first 500000 revs or so of the ASF repo with
svn-fe, it seems that fast-import gets a bit sluggish because its
object table fills up.  David noticed that one can get a bit of a
speed-up by noticing that objects are likely to be accessed soon after
they are inserted in this workflow.  (The basis for most svn deltas
comes from the revision just dumped.)

I would guess other workflows would display the same tendency if any;
at least, the same blob is likely to come up repeatedly in a few revs
and then not be mentioned for a while.  Though maybe there is some
reason to make the opposite assumption that I have not thought of.
Shawn?

A more dramatic improvement comes from increasing the size of the hash
table; in particular, David noticed it allows the CPU usage to
increase from ~60% to 100% of one core.  So presumably we should make
the size of the hash table dynamic.

Other aspects to investigate: choice of hash function; is it worth
moving accessed entries to the front of hash chains when they already
exist?

Timings in interesting workflows would also be nice.

Regardless, this seems like a good change on its own, especially
because it simplifies the code.

Thoughts?
Jonathan

 fast-import.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index 77549eb..e2f4874 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -539,22 +539,17 @@ static struct object_entry *insert_object(unsigned char *sha1)
 {
 	unsigned int h = sha1[0] << 8 | sha1[1];
 	struct object_entry *e = object_table[h];
-	struct object_entry *p = NULL;

 	while (e) {
 		if (!hashcmp(sha1, e->idx.sha1))
 			return e;
-		p = e;
 		e = e->next;
 	}

 	e = new_object(sha1);
-	e->next = NULL;
+	e->next = object_table[h];
 	e->idx.offset = 0;
-	if (p)
-		p->next = e;
-	else
-		object_table[h] = e;
+	object_table[h] = e;
 	return e;
 }

-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket
  2010-11-23  7:53 [PATCH/RFC] fast-import: insert new object entries at start of hash bucket Jonathan Nieder
@ 2010-11-23 12:51 ` Sverre Rabbelier
  2010-11-23 18:19   ` Jonathan Nieder
  2010-11-23 22:33 ` Shawn Pearce
  1 sibling, 1 reply; 6+ messages in thread
From: Sverre Rabbelier @ 2010-11-23 12:51 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, Shawn O. Pearce, David Barr, Nicolas Pitre, Raja R Harinath

Heya,

On Tue, Nov 23, 2010 at 08:53, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Thoughts?

Are there (m)any non-artificial cases for which this slows things down (a lot)?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket
  2010-11-23 12:51 ` Sverre Rabbelier
@ 2010-11-23 18:19   ` Jonathan Nieder
  0 siblings, 0 replies; 6+ messages in thread
From: Jonathan Nieder @ 2010-11-23 18:19 UTC (permalink / raw)
  To: Sverre Rabbelier
  Cc: git, Shawn O. Pearce, David Barr, Nicolas Pitre, Raja R Harinath

Sverre Rabbelier wrote:
> On Tue, Nov 23, 2010 at 08:53, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> Thoughts?
>
> Are there (m)any non-artificial cases for which this slows things down (a lot)?

If a very common blob (e.g., the empty blob) is in the same bucket as
a few uncommon ones, it is likely to be inserted first and this would
slow down access to it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket
  2010-11-23  7:53 [PATCH/RFC] fast-import: insert new object entries at start of hash bucket Jonathan Nieder
  2010-11-23 12:51 ` Sverre Rabbelier
@ 2010-11-23 22:33 ` Shawn Pearce
  2010-11-23 23:17   ` Jonathan Nieder
  1 sibling, 1 reply; 6+ messages in thread
From: Shawn Pearce @ 2010-11-23 22:33 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, David Barr, Nicolas Pitre, Raja R Harinath, Sverre Rabbelier

On Mon, Nov 22, 2010 at 11:53 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> More often than not, find_object is called for recently inserted objects.
> Optimise for this case by inserting new entries at the start of the chain.
> This doesn't affect the cost of new inserts but reduces the cost of find
> and insert for existing object entries.

This is a good change.  I totally agree with it.

> I would guess other workflows would display the same tendency if any;
> at least, the same blob is likely to come up repeatedly in a few revs
> and then not be mentioned for a while.  Though maybe there is some
> reason to make the opposite assumption that I have not thought of.
> Shawn?

The new entry was put onto the end of the hash chain because I wasn't
thinking when I wrote that code.  For some reason I assumed it made
sense to put it onto the end of the chain, because I had already
walked the chain to see if it already exists... and by the time I
reached the end, I had the tail, so I might as well put the new item
on to the tail.

Moving the new item to the head works because it is somewhat likely
that the new item will be asked for in the near future, and older
items are less likely to be asked for.  So I can see how this can
offer a small CPU time improvement for a very big import.

> A more dramatic improvement comes from increasing the size of the hash
> table; in particular, David noticed it allows the CPU usage to
> increase from ~60% to 100% of one core.  So presumably we should make
> the size of the hash table dynamic.

Uhm.  Wait, the table isn't already dynamic?  I cheated that badly and
made the table fixed size?

I can tell you why its fixed size... when I wrote fast-import to
support the Mozilla repository import, we had an estimate on the
number of objects that we needed fast-import to handle in a given run.
 From that estimate we concluded that a table of 2^16 was sufficiently
large enough, as the hash chains would only be some small N long given
the total number of objects we needed to handle for Mozilla.  Doubling
that into 2^17 or larger wasn't useful, and using a smaller table like
2^14 produced too long of a chain.

Once this code was working, we moved on to other features, and never
reconsidered the table size.  This table should be growing if the
initial 2^16 isn't big enough.  Its just that nobody has ever noticed
that I hardcoded the size.  :-)

> Other aspects to investigate: choice of hash function;

Why?  SHA-1 is pretty uniform in its distribution.  The object_table
is taking the first 16 bits and using that to index the object, that's
the first 4 hex digits.  The Linux kernel uniquely identifies an
object with under 10 hex digits, or 40 leading bits.  The table just
needs to double or quadruple its size (and have the hash function
include that many more bits) when an arbitrary bucket's chain length
gets over some reasonable constant number (e.g. 16).  We probably do
need to cap the table size though, especially on 32 bit builds of Git.

> is it worth
> moving accessed entries to the front of hash chains when they already
> exist?

Maybe, maybe not?  It depends on what the front end is doing.

If the front end is feeding hex SHA-1s, and will frequently reuse the
same blobs, yes.  Otherwise, maybe not.  The only time we are looking
for an object in the object_table is when we are about to insert an
object and need to avoid a duplicate (that's the insert_object code
path), or when we have to page in a tree into the branch cache.

If the branch cache is spilling, you probably should be increasing its
size so that it doesn't spill as often.

The code within store_tree() is really inefficient.  It looks up the
old tree to find out how its stored... we could store that struct
object_entry* inside of the struct tree_content.  We also could avoid
calling store_object() if the tree hasn't been modified.  The way I
read this store_tree() code, every subdirectory is recursed into even
if no modifications were made inside of that subdirectory during the
current commit.  We just need to format the tree, and if the SHA-1 is
the same as the old SHA-1, skip back out without calling
store_object().  That should give you a pretty dramatic reduction on
the number of times the object_table is accessed during an import.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket
  2010-11-23 22:33 ` Shawn Pearce
@ 2010-11-23 23:17   ` Jonathan Nieder
  2010-11-23 23:29     ` Shawn Pearce
  0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Nieder @ 2010-11-23 23:17 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: git, David Barr, Nicolas Pitre, Raja R Harinath, Sverre Rabbelier

Shawn Pearce wrote:
> On Mon, Nov 22, 2010 at 11:53 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> Other aspects to investigate: choice of hash function;
>
> Why?  SHA-1 is pretty uniform in its distribution.

I got distracted for a moment by the atom table, but since that does
not have a big effect on performance it's probably not worth spending
time on.  Sorry about that; please ignore.

[...]
>                                                           The way I
> read this store_tree() code, every subdirectory is recursed into even
> if no modifications were made inside of that subdirectory during the
> current commit.

Doesn't the is_null_sha1 check avoid that?

To further explain the workload: svn-fe receives its blobs from svn
in the form of deltas.  So the conversation might go like this:

	S	commit refs/heads/master
	S	mark :10000
	S	committer felicity <felicity@local>
	S	data 74
	S	bug 3097: switch spamd from doing 'fork per message' to a 'prefork' model
	S	cat incubator/spamassassin/trunk/spamd/spamd.raw
	F	89d56462577b8b7b4f4115f2a47f0b3da22b791a blob 63633
	F	#!/usr/bin/perl -w -T
	...
	S	M 100644 inline incubator/spamassassin/trunk/spamd/spamd.raw
	S	data 62114
	...

Current svn-fe in vcs-svn-pu requests the preimage blobs using marks,
but the idea is the same.

If this proves a bottleneck I suppose we could cache the content of
frequently-requested old blobs and keep pointers to that in the
in-core tree.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket
  2010-11-23 23:17   ` Jonathan Nieder
@ 2010-11-23 23:29     ` Shawn Pearce
  0 siblings, 0 replies; 6+ messages in thread
From: Shawn Pearce @ 2010-11-23 23:29 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, David Barr, Nicolas Pitre, Raja R Harinath, Sverre Rabbelier

On Tue, Nov 23, 2010 at 3:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Shawn Pearce wrote:
>> On Mon, Nov 22, 2010 at 11:53 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>>                                                           The way I
>> read this store_tree() code, every subdirectory is recursed into even
>> if no modifications were made inside of that subdirectory during the
>> current commit.
>
> Doesn't the is_null_sha1 check avoid that?

Oh, right, yes.  That check avoids processing a tree that wasn't
modified.  Damn that code isn't very clear.  :-(

> To further explain the workload: svn-fe receives its blobs from svn
> in the form of deltas.  So the conversation might go like this:
...
> Current svn-fe in vcs-svn-pu requests the preimage blobs using marks,
> but the idea is the same.
>
> If this proves a bottleneck I suppose we could cache the content of
> frequently-requested old blobs and keep pointers to that in the
> in-core tree.

OK, now I get it.

fast-import already uses sha1_file.c's delta base cache for the delta
bases that are frequently accessed.  It *may* help an importer that
needs to keep refetching the same blob to cache the blob itself rather
than its delta base (and reapply every time).  It might not help too,
it may hurt to try and keep recently accessed objects that the
importer won't need to request again.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-11-23 23:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-23  7:53 [PATCH/RFC] fast-import: insert new object entries at start of hash bucket Jonathan Nieder
2010-11-23 12:51 ` Sverre Rabbelier
2010-11-23 18:19   ` Jonathan Nieder
2010-11-23 22:33 ` Shawn Pearce
2010-11-23 23:17   ` Jonathan Nieder
2010-11-23 23:29     ` Shawn Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).