* [PATCH/RFC] fast-import: insert new object entries at start of hash bucket @ 2010-11-23 7:53 Jonathan Nieder 2010-11-23 12:51 ` Sverre Rabbelier 2010-11-23 22:33 ` Shawn Pearce 0 siblings, 2 replies; 6+ messages in thread From: Jonathan Nieder @ 2010-11-23 7:53 UTC (permalink / raw) To: git Cc: Shawn O. Pearce, David Barr, Nicolas Pitre, Raja R Harinath, Sverre Rabbelier From: David Barr <david.barr@cordelta.com> More often than not, find_object is called for recently inserted objects. Optimise for this case by inserting new entries at the start of the chain. This doesn't affect the cost of new inserts but reduces the cost of find and insert for existing object entries. Signed-off-by: David Barr <david.barr@cordelta.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> --- Hi, After importing the first 500000 revs or so of the ASF repo with svn-fe, it seems that fast-import gets a bit sluggish because its object table fills up. David noticed that one can get a bit of a speed-up by noticing that objects are likely to be accessed soon after they are inserted in this workflow. (The basis for most svn deltas comes from the revision just dumped.) I would guess other workflows would display the same tendency if any; at least, the same blob is likely to come up repeatedly in a few revs and then not be mentioned for a while. Though maybe there is some reason to make the opposite assumption that I have not thought of. Shawn? A more dramatic improvement comes from increasing the size of the hash table; in particular, David noticed it allows the CPU usage to increase from ~60% to 100% of one core. So presumably we should make the size of the hash table dynamic. Other aspects to investigate: choice of hash function; is it worth moving accessed entries to the front of hash chains when they already exist? Timings in interesting workflows would also be nice. Regardless, this seems like a good change on its own, especially because it simplifies the code. Thoughts? Jonathan fast-import.c | 9 ++------- 1 files changed, 2 insertions(+), 7 deletions(-) diff --git a/fast-import.c b/fast-import.c index 77549eb..e2f4874 100644 --- a/fast-import.c +++ b/fast-import.c @@ -539,22 +539,17 @@ static struct object_entry *insert_object(unsigned char *sha1) { unsigned int h = sha1[0] << 8 | sha1[1]; struct object_entry *e = object_table[h]; - struct object_entry *p = NULL; while (e) { if (!hashcmp(sha1, e->idx.sha1)) return e; - p = e; e = e->next; } e = new_object(sha1); - e->next = NULL; + e->next = object_table[h]; e->idx.offset = 0; - if (p) - p->next = e; - else - object_table[h] = e; + object_table[h] = e; return e; } -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket 2010-11-23 7:53 [PATCH/RFC] fast-import: insert new object entries at start of hash bucket Jonathan Nieder @ 2010-11-23 12:51 ` Sverre Rabbelier 2010-11-23 18:19 ` Jonathan Nieder 2010-11-23 22:33 ` Shawn Pearce 1 sibling, 1 reply; 6+ messages in thread From: Sverre Rabbelier @ 2010-11-23 12:51 UTC (permalink / raw) To: Jonathan Nieder Cc: git, Shawn O. Pearce, David Barr, Nicolas Pitre, Raja R Harinath Heya, On Tue, Nov 23, 2010 at 08:53, Jonathan Nieder <jrnieder@gmail.com> wrote: > Thoughts? Are there (m)any non-artificial cases for which this slows things down (a lot)? -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket 2010-11-23 12:51 ` Sverre Rabbelier @ 2010-11-23 18:19 ` Jonathan Nieder 0 siblings, 0 replies; 6+ messages in thread From: Jonathan Nieder @ 2010-11-23 18:19 UTC (permalink / raw) To: Sverre Rabbelier Cc: git, Shawn O. Pearce, David Barr, Nicolas Pitre, Raja R Harinath Sverre Rabbelier wrote: > On Tue, Nov 23, 2010 at 08:53, Jonathan Nieder <jrnieder@gmail.com> wrote: >> Thoughts? > > Are there (m)any non-artificial cases for which this slows things down (a lot)? If a very common blob (e.g., the empty blob) is in the same bucket as a few uncommon ones, it is likely to be inserted first and this would slow down access to it. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket 2010-11-23 7:53 [PATCH/RFC] fast-import: insert new object entries at start of hash bucket Jonathan Nieder 2010-11-23 12:51 ` Sverre Rabbelier @ 2010-11-23 22:33 ` Shawn Pearce 2010-11-23 23:17 ` Jonathan Nieder 1 sibling, 1 reply; 6+ messages in thread From: Shawn Pearce @ 2010-11-23 22:33 UTC (permalink / raw) To: Jonathan Nieder Cc: git, David Barr, Nicolas Pitre, Raja R Harinath, Sverre Rabbelier On Mon, Nov 22, 2010 at 11:53 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: > More often than not, find_object is called for recently inserted objects. > Optimise for this case by inserting new entries at the start of the chain. > This doesn't affect the cost of new inserts but reduces the cost of find > and insert for existing object entries. This is a good change. I totally agree with it. > I would guess other workflows would display the same tendency if any; > at least, the same blob is likely to come up repeatedly in a few revs > and then not be mentioned for a while. Though maybe there is some > reason to make the opposite assumption that I have not thought of. > Shawn? The new entry was put onto the end of the hash chain because I wasn't thinking when I wrote that code. For some reason I assumed it made sense to put it onto the end of the chain, because I had already walked the chain to see if it already exists... and by the time I reached the end, I had the tail, so I might as well put the new item on to the tail. Moving the new item to the head works because it is somewhat likely that the new item will be asked for in the near future, and older items are less likely to be asked for. So I can see how this can offer a small CPU time improvement for a very big import. > A more dramatic improvement comes from increasing the size of the hash > table; in particular, David noticed it allows the CPU usage to > increase from ~60% to 100% of one core. So presumably we should make > the size of the hash table dynamic. Uhm. Wait, the table isn't already dynamic? I cheated that badly and made the table fixed size? I can tell you why its fixed size... when I wrote fast-import to support the Mozilla repository import, we had an estimate on the number of objects that we needed fast-import to handle in a given run. From that estimate we concluded that a table of 2^16 was sufficiently large enough, as the hash chains would only be some small N long given the total number of objects we needed to handle for Mozilla. Doubling that into 2^17 or larger wasn't useful, and using a smaller table like 2^14 produced too long of a chain. Once this code was working, we moved on to other features, and never reconsidered the table size. This table should be growing if the initial 2^16 isn't big enough. Its just that nobody has ever noticed that I hardcoded the size. :-) > Other aspects to investigate: choice of hash function; Why? SHA-1 is pretty uniform in its distribution. The object_table is taking the first 16 bits and using that to index the object, that's the first 4 hex digits. The Linux kernel uniquely identifies an object with under 10 hex digits, or 40 leading bits. The table just needs to double or quadruple its size (and have the hash function include that many more bits) when an arbitrary bucket's chain length gets over some reasonable constant number (e.g. 16). We probably do need to cap the table size though, especially on 32 bit builds of Git. > is it worth > moving accessed entries to the front of hash chains when they already > exist? Maybe, maybe not? It depends on what the front end is doing. If the front end is feeding hex SHA-1s, and will frequently reuse the same blobs, yes. Otherwise, maybe not. The only time we are looking for an object in the object_table is when we are about to insert an object and need to avoid a duplicate (that's the insert_object code path), or when we have to page in a tree into the branch cache. If the branch cache is spilling, you probably should be increasing its size so that it doesn't spill as often. The code within store_tree() is really inefficient. It looks up the old tree to find out how its stored... we could store that struct object_entry* inside of the struct tree_content. We also could avoid calling store_object() if the tree hasn't been modified. The way I read this store_tree() code, every subdirectory is recursed into even if no modifications were made inside of that subdirectory during the current commit. We just need to format the tree, and if the SHA-1 is the same as the old SHA-1, skip back out without calling store_object(). That should give you a pretty dramatic reduction on the number of times the object_table is accessed during an import. -- Shawn. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket 2010-11-23 22:33 ` Shawn Pearce @ 2010-11-23 23:17 ` Jonathan Nieder 2010-11-23 23:29 ` Shawn Pearce 0 siblings, 1 reply; 6+ messages in thread From: Jonathan Nieder @ 2010-11-23 23:17 UTC (permalink / raw) To: Shawn Pearce Cc: git, David Barr, Nicolas Pitre, Raja R Harinath, Sverre Rabbelier Shawn Pearce wrote: > On Mon, Nov 22, 2010 at 11:53 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: >> Other aspects to investigate: choice of hash function; > > Why? SHA-1 is pretty uniform in its distribution. I got distracted for a moment by the atom table, but since that does not have a big effect on performance it's probably not worth spending time on. Sorry about that; please ignore. [...] > The way I > read this store_tree() code, every subdirectory is recursed into even > if no modifications were made inside of that subdirectory during the > current commit. Doesn't the is_null_sha1 check avoid that? To further explain the workload: svn-fe receives its blobs from svn in the form of deltas. So the conversation might go like this: S commit refs/heads/master S mark :10000 S committer felicity <felicity@local> S data 74 S bug 3097: switch spamd from doing 'fork per message' to a 'prefork' model S cat incubator/spamassassin/trunk/spamd/spamd.raw F 89d56462577b8b7b4f4115f2a47f0b3da22b791a blob 63633 F #!/usr/bin/perl -w -T ... S M 100644 inline incubator/spamassassin/trunk/spamd/spamd.raw S data 62114 ... Current svn-fe in vcs-svn-pu requests the preimage blobs using marks, but the idea is the same. If this proves a bottleneck I suppose we could cache the content of frequently-requested old blobs and keep pointers to that in the in-core tree. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket 2010-11-23 23:17 ` Jonathan Nieder @ 2010-11-23 23:29 ` Shawn Pearce 0 siblings, 0 replies; 6+ messages in thread From: Shawn Pearce @ 2010-11-23 23:29 UTC (permalink / raw) To: Jonathan Nieder Cc: git, David Barr, Nicolas Pitre, Raja R Harinath, Sverre Rabbelier On Tue, Nov 23, 2010 at 3:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: > Shawn Pearce wrote: >> On Mon, Nov 22, 2010 at 11:53 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: >> The way I >> read this store_tree() code, every subdirectory is recursed into even >> if no modifications were made inside of that subdirectory during the >> current commit. > > Doesn't the is_null_sha1 check avoid that? Oh, right, yes. That check avoids processing a tree that wasn't modified. Damn that code isn't very clear. :-( > To further explain the workload: svn-fe receives its blobs from svn > in the form of deltas. So the conversation might go like this: ... > Current svn-fe in vcs-svn-pu requests the preimage blobs using marks, > but the idea is the same. > > If this proves a bottleneck I suppose we could cache the content of > frequently-requested old blobs and keep pointers to that in the > in-core tree. OK, now I get it. fast-import already uses sha1_file.c's delta base cache for the delta bases that are frequently accessed. It *may* help an importer that needs to keep refetching the same blob to cache the blob itself rather than its delta base (and reapply every time). It might not help too, it may hurt to try and keep recently accessed objects that the importer won't need to request again. -- Shawn. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-11-23 23:30 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-11-23 7:53 [PATCH/RFC] fast-import: insert new object entries at start of hash bucket Jonathan Nieder 2010-11-23 12:51 ` Sverre Rabbelier 2010-11-23 18:19 ` Jonathan Nieder 2010-11-23 22:33 ` Shawn Pearce 2010-11-23 23:17 ` Jonathan Nieder 2010-11-23 23:29 ` Shawn Pearce
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).